Searchable Computer Architecture Hwang Brigg Important
Searchable Computer Architecture Hwang Brigg Important
Searchable Computer Architecture Hwang Brigg Important
B®
@ Tain, 7
—GUACAON
/
AE
aa
SU
oe
: a a
7 ee
y a mex ae i
:We
eee
.
a
ae
oe
ce
a
Se
coe
oe
oF ce
Ee
ee
-
e
oes
oe
aie
ve
a : ee
hie
aaUa
.
COMPUTER ARCHITECTURE
AND PARALLEL PROCESSING
McGraw-Hill Series in Computer Organization and Architecture
WWW.Gitmgurgaon.blogspot.com
McGraw-Hill Computer Science Series
Ahuja: Design and Analysis of Computer Communication Networks
Barbacei and Siewiarek: The Design and Analysis of Instruction Set Processors
Cavanagh: Digital Computer Arithmetic: Design and implementation
Ceri and Pelagatti: Distributed Databases: Principles and Systems
Donovan: Systems Programming
Filman and Friedman: Coordinated Computing: Tools and Techniques for Distributed Seftware
Givone: Introduction to Switching Circuit Theory
Goodman and Hedetniemi: introduction to the Design and Analysis af Algorithms
Katzan: Micropragramming Primer
Keller: A First Course in Computer Programming Using Pascal
Kohavi: Switching and Finite Automata Theory
Liu: Elements of Discrete Mathematics
Lin: Introduction to Combinatorial Mathematics
MacEwen: Introduction to Computer Systems: Using the PDP-11 and Pascal
Madnick and Donovan: Operating Systems
Manna: Mathematical Theory of Computation
Newman and Sproull: Principles of inteructive Computer Graphics
Payne: fntroduction 10 Simulation: Programming Techniques and Methods of Analysis
Révész: /ntroduction to Farmal Languages
Rice: Matrix Computations and Mathematical Software
Salton and McGill: /ntreductian io Modern Information Retrieval
Shooman: Software Enyineering : Design, Reliability, and Management
Tremblay and Bunt: An Introduction to Computer Science: An Algorithmic Approach
Tremblay and Bunt: An /ntroduction to Computer Science: An Algorithmic Approach,
Shari Edition
Tremblay and Manshar: Discrete Mathematical Structures with Applications to Camputer
Sclence
Tremblay and Sorensea: An Introduction to Data Structures with Applications
Tucker: Programming Languages
Wiederhold: Database Design
Wulf, Levin, and Harbison: Hydra/C. mmnp: An Experimental Camputer System
COMPUTER
ARCHITECTURE
AND
PARALLEL
PROCESSING
Kai Hwang
Inver sity of Southern California
Fayé A. Briggs
Rice University
WWW.Gitmgurgaon.blogspot.com
This book was set in Times Roman.
The editors were Eric M. Munson and Jonathan Palace;
the production supervisor was Leroy A, Young.
The drawings were done by ANCO/Bastoa.
Halliday Lithograph Corporation was printer and binder.
ISBN O-O07-O3155b-6
Hwang, Kai.
Computer architecture and parallel processing.
WWW.Gitmgurgaon.blogspot.com
To my parents,
Hwang Yuan-Chung and Liu Cheng Fong,
my wife, Pu Fong,
and my sons, Tony and Andy.
Kai Hwang
SK SE
To my grandparents,
Rev. P. B. Harry and Mrs. B, P. Harry.
Faye A. Briggs
Preface xv
WWW.Gitmgurgaon.blogspot.com
X CONTENTS
Ne
Io
2.2.1 The Concept of Virtual Memory
2.2.2 Paged Memory System
2.2.3 Segmented Memory System
2.2.4 Memory with Paged Segments
Memory Allocation and Management
2.3.1 Classification of Memory Policies
2.3.2 Optimal Load Cenitrol
2.3.3. Memory Management Policies
2.4 Cache Memories and Management
2.4.14 Characteristics of Cache Memories
2.4.2 Cache Memory Organizations
2.4.3 Fetch and Main Memory Update Policies
2.4.4 Block Replacement Policies
2.5 Input-Output Subsystems
2.5.1 Characteristics of 1/0 Subsystem
2.5.2 Interrupt Mechanisms and Special Hardware
2.5.3 1/0 Processors and 1/0 Channels
2.6 Bibliographic Notes and Problems
WWW.Gitmgurgaon.blogspot.com
xii CONTENTS
WWW.Gitmgurgaon.blogspot.com
xi¥ CONTENTS
Bibliography 813
index 833
PREFACE
WWW.Gitmgurgaon.blogspot.com
XVE PREFACE
language. Sections marked with asterisk (*) are research-oriented topics. Readers
are expected to have some background on discrete mathematics and probability
theory in studying these research topics. These difficult sections can be skipped in
the first reading without loss of continuity. Homewerk problems are essential to
provide readers with in-depth thinking and hands-on experience in the design,
application, and evaluation of parallel computers.
Parallel processing and computer architecture are two wide-open areas for
research and development. We hope that this book will inspire further advances in
these frontier computer areas. Bibliographic notes are attached at the end of each
chapter to help interested readers find additional references for extended studies.
The authors are fully responsible for any errors or omissions. We apologize to
those computer specialists whose original works are not included in this volume.
The computer area is changing so rapidly that no book can cover every new
progress being made. However, we do welcome inputs and criticisms from our
readers. Readers are invited to send their comments directly to the authors, so
that improvement can be made in future printings or revisions of the book.
This book can be used as a text when offering a sequence of two courses on
computer architecture and parallel processing. Each course contains 45 lectures,
each 50 minutes long. We suggest the following materials be covered in the first
course of a two-course sequence. The remaining sections are reserved for the
second course,
LL b2n2b 3, EBS Ba 18
2 ZA ZEA 220 224-2. 2.54
3 B.A1-2, 3.2.1-2, 3.3.1. 2.3.4.1
4 44.21. 4.3.1, 441-3
i 54, 5,262, $3.1, 8.4.1
6 6.4,6,.2.1-7, 63.1641
7 TALT 24-2, 14.1, 781
8 BLL BZ S38 Ba
9 SALI OS 1-2. 96.1
IG 101.6 10.2.1, 10.3.4
The first course is suitable for senior and first-year graduate students. The
second course is mainly for graduate students. The first course is a prerequisite for
the second course. If the book is adopted for only one course offering, the instructor
can move some sections from the second course to the first one to give more
complete coverage of some selected topics which are of special interest to the
instructor and students. This may necessitate trading some sections listed above
with the added sections from the second course. A Solutions Manual to this book
will be available from McGraw-Hill for instructors only. The manual contains
solutions to all problems plus a number of design projects suitable for use as term
WWW.Gitmgurgaon.blogspot.com
xviii PREFACE
Kai Hwang
Fayé A. Briggs
CHAPTER
CO-1-> CHAP 1&2
ONE
INTRODUCTION TO PARALLEL PROCESSING
SEP 1
Over the past four decades the computer industry has experienced four generations
of development, physically marked by the rapid changing of building blocks from
relays and vacuum tubes (1940-1950s) to discrete diodes and transistors (1950-
1960s), to small- and medium-scale integrated (SSI/MSI) circuits (1960-1970s),
and to large- and very-large-scale integrated (LSI/VLSI) devices (1970s and
beyond). Increases in device speed and reliability and reductions in hardware
cost and physical size have greatly enhanced computer performance. However,
better devices are not the sole factor contributing to high performance. Ever since
the stored-program concept of von Neumann, the computer has been recognized
as more than just a hardware organization problem. A modern computer system is
really a composite of such items as processors, memories, functional units, inter-
connection networks, compilers, operating systems, peripheral
devices, communica~
tion channels, and database banks.
To design a powerful and cost-effective computer system and to devise efficient
programs to solve a computational problem, one must understand the underlying
1
WWW.Gitmgurgaon.blogspot.com
2 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
The first generation (1938-1953) The introduction of the first electronic analog
computer in 1938 and the first electronic digital computer, ENIAC (Electronic
Numerical Integrator and Computer), in 1946 marked the beginning of the first
generation of computers, Electromechanical relays were used as switching devices
Computer
generation
I
Third Le eecsecccceecene
eee cette eetereeeeneeees a
Second -----*$,
Fit |
| L | { | i
1940 1950 1960 1970 1980 1990 Year
Figure 1,1 The evolution of computer systems.
INTRODUCTION TO PARALLEL PROCESSING 3
in the 1940s, and vacuum tubes were used in the 1950s. These devices were inter-
connected by insulated wires. Hardware components were expensive then, which
forced the CPU structure to be d/t-serial. arithmetic is done on a bit-by-bit
fixed-point basis, as in a ripple-carry addition which uses a single full adder and
one bit of carry flag.
Only binary-coded machine language was used in early computers. In 1950,
the first stored-program computer, EDVAC (Electronic Discrete Variable
Automatic Computer), was developed. This marked the beginning of the use of
system software to relieve the user’s burden in low-level programming. However,
it is not difficult to imagine that hardware costs predominated and software-
language features were rather primitive in the early computers. By 1952, IBM had
announced its 701 electronic calculator. The system used Williams’ tube memory,
magnetic drums, and magnetic tape.
The second generation (1952-1963) Transistors were invented in 1948. The first
fransistorized digital computer, TRADIC, was built by Bell Laboratories in 1954.
Discrete transistors and diodes were the building blocks: 800 transistors were
used in TRADIC. Printed circuits appeared, By this time, coincident current
magnetic core memory was developed and subsequently appeared in many
machines. Assembly languages were used until the development of high-level
languages, Fortran ( formula translation) in 1956 and Algol (algorithmic language)
in 1960.
In 1959, Sperry Rand built the Larc system and IBM started the Stretch
projeet. These were the first two computers attributable to architectural improve-
ment. The Lare had an independent I/O processor which operated in parallel with
one or two processing units. Stretch featured instruction lookahead and error
correction, to be discussed in Section 1.2. The first IBM scientific, transistorized
computer, IBM 1620, became available in 1960. Cobol (common business oriented
language) was developed in 1959. Interchangeable disk packs were introduced
in 1963. Batch processing was popular, providing sequential execution of user
programs, one at a time until done.
The third generation (1962-1975) This generation was marked by the use of
small-scale integrated (SSI) and medium-scale integrated (MSI) circuits as the
basic building blocks, Multilayered printed circuits were used. Core memory was
sul used in CDC-6600 and other machines but, by 1968, many fast computers,
like CDC-7660, began to replace cores with solid-state memories. High-level
languages were greatly enhanced with intelligent compilers during this period.
Multiprogramming was well developed to allow the simultaneous execution of
many program segments interleaved with I/O operations. Many high-performance
computers, like IBM 360/91, Illiac IV, TI-ASC, Cyber-175, STAR-100,
and C.mmp,
and several vector processors were developed in the early seventies. Time-sharing
operating systems became available in the late 1960s. Virtual memory was de-
veloped by using hierarchically structured memory systems.
WWW.Gitmgurgaon.blogspot.com
4 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Fhe future Computers to be used in the 1990s may be the next generation. Very-
large-scale integrated (VLSB) chips will be used along with high-density modular
design. Multiprecessors like the [6 processors in the S-1 project at Lawrence
Livermore National Laboratory and in the Denelcor’s HEP will be required.
Cray-2 is expected to have four processors, to be delivered in 1985. More than 1000
mega float-point operations per second (megaflops) are expected in these future
supercomputers. We will study major existing systems and discuss possible future
machines in subsequent chapters.
* Data processing
e Information processing
Knowledge processing
e Intelligence processing
Intelligence
processing
Knowledge
processing
Tncreasing
Increasing volumes
complexity and
of raw material
sophistication
to be processed
in processing
Information
processing
Data
processing
Figure 1.2 The spaces of data, information, knowledge, and intelligence from the viewpoint of computer
processing.
WWW.Gitmgurgaon.blogspot.com
6 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
e Batch processing
« Multiprogramming
e Time sharing
e Multiprocessing
In these four operating modes, the degree of parallelism increases sharply from
phase to phase. The general trend is to emphasize parallel processing of information.
In what follows, the term information is used with an extended meaning to include
data, information, knowledge, and intelligence, We formally define parallel
processing as follows:
WWW.Gitmgurgaon.blogspot.com
8 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Console
CPU s
Roe |Lrc | 53
3
Floppy | &
e|[ Main
disk E . s memory
3 R15 ALU E (2 words
3 Registers a of 32 bits
Ed oo each)
8 g
a
A
¥ 1
< Synchronous backplane interconnect (SBD >
Unibus
Kunis > Massbus “'—hasbus
adapter adapter TT]
WWW.Gitmgurgaon.blogspot.com
10 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Main memory
|
Storage controller
1/0 channels
I/O subsystem
Figure 1.4 The system architecture of the mainframe [BM System 370/Model 168 uniprocessor computer
(Courtesy of International Business Machines Corp.).
Peripheral 10 “add
processors functional -
units Multiply
PPOE Multiply
PPLE: Divide
PP3 : 24 Increment
WWW.Gitmgurgaon.blogspot.com
12, COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
arithmetic, and the other for floating-point arithmetic. Within the floating-point
E unit are two functional units: one for floating-point add-subtract and the other
for floating-point multiply-divide. IBM 360/91 is a highly pipelined, multifunction,
scientific uniprocessor. We will study 360/91 in detail in Chapter 3. Almost all
modern computers and uttached processors are equipped with multiple functional
units to perform parallel or simultaneous arithmetic logic operations. This practice
of functional specialization and distribution can be extended to array processors
and multiprocessors, to be discussed in subsequent chapters.
Parallelism and pipelining within the CPU Parallel adders, using such techniques
as carry-lookahead and carry-save, are now built into almost all ALUs. This is in
contrast to the bit-serial adders used in the first-generation machines, High-speed
multiplier recoding and convergence division are techniques for exploring
parallelism and the sharing of hardware resources for the functions of multiply
and divide (to be described in Section 3.2.2), The use of multiple functional units
is a form of parallelism with the CPU,
Various phases of instruction executions are now pipelined, including instruc-
tion fetch, decode, operand fetch, arithmetic logic execution, and store result. To
facilitate overlapped instruction executions through the pipe, instruction prefetch
and data buffering techniques have been developed, Instruction and arithmetic
pipeline designs will be covered in Chapters 3 and 4, Most commercial uniprocessor
systems are now pipelined in their CPU with a clock rate between 10 and 500 ns,
Overlapped CPU and J/O operations I/O operations can be performed simul-
taneously with the CPU computations by using separate I/O controllers, channels,
or [/O processors. The direct-memory-access (DMA) channel can be used to
provide direct information transfer between the I/O devices and the main memory.
The DMA is conducted on a cycle-stealing basis, which is apparent to the CPU.
Furthermore, F/O mudtiprocessing, such as the use of the 101/O processors in
CDC-6600 (Figure 1.5), can speed up data transfer between the CPU (or memory)
and the outside world. 1/O subsystems for supporting parallel processing will be
described in Section 2.5. Back-end database machines can be used to manage large
databases stored on disks.
Use of hierarchical memory system Usually, the CPU is about 1000 times faster
than memory access. A hierarchical memory system can be used to close up the
speed gap. Computer memory hierarchy is conceptually illustrated in Figure 1.6.
The innermost level is the register files directly addressable by ALU. Cache memory
can be used to serve as a buffer between the CPU and the main memory. Block
access of the main memory can be achieved through multiway interleaving across
parallel memory modules (see Figure 1.4). Virtual memory space can be established
with the use of disks and tape units at the outer levels.
Details of memory subsystems for both uniprocessor and multiprocessor
computers are given in Chapter 2. Various interleaved memory organizations are
given in Section 3.1.4. Parallel memories for array processors are treated in
INTRODUCTION TO PARALLEL PROCESSING 13
CPU
Main memory
{RAMs or core)
Section 6.2.4, along with the description of the Burroughs Scientific Processor
(1978). Multiprocessor memory and cache coherence problems will be treated in
Section 7.3. All these techniques are intended to broaden the memory bandwidth
to match that of the CPU.
WWW.Gitmgurgaon.blogspot.com
14 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
For example, the [BM 370/168 has ¢, = 5 ms (disk), ¢,, = 320 ns, and t, = 80 ns.
With these speed gaps between the subsystems, we need to match their processing
bandwidths in order to avoid a system bottleneck problem,
The bandwidth of a system is defined as the number of operations performed
per unit time. In the case of a main memory system, the memory bandwidth is
measured by the number of memory words that can be accessed (either fetch or
store) per unit time. Let W be the number of words delivered per memory cycle 1,,.
Then the maximum memory bandwidth B,, is equal to
For example, the 1BM 3033 uniprocessor has a processor cycle t, = 57 ns. Eight
double words (8 bytes cach} can be requested from an eight-way interleaved
memory system (with eight LSEs in Figure 1.7) per each memory cycle 4, =
456 ns. Thus, the maximum memory bandwidth of the 3033 is B,, = 8 x 8 bytes/456
ns = 140 megabytes/s. Memory access conflicts may cause delayed access of some
of the processor requests. In practice, the utilized memory bandwidth B%, is usually
lower than B,,; that is, BY < B,,. A rough measure of By, has been suggested as
By = 2m. 1.3)
/ 4
where X,, is the number of word results and T, is the total CPU time required to
generate the R,, results. For a machine with variable word length, the rate will
vary. For example. the CDC Cyber-205 has a peak CPU rate of 200 megaflops for
INTRODUCTION TO PARALLEL PROCESSING 15
Byte 0
°
°
e
Byte 7
Eight a
double
words Byte 8 LSE ]
{64 bytes}
orsixteeri e
32-bit x °
words ®
are
accessed
per Byte; 15
memory
cycle
e e
Byte| 56 LSE 7
e
°
Byie! 63
Logical
slorage elements
Figure 1.7 The interleaved memory structure in IBM 3033 uniprocessor.
WWW.Gitmgurgaon.blogspot.com
16 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
32-bit results and only 100 megaflops for 64-bit results (one vector processor is
assumed).
Based on current technology (1983), the following relationships have been
observed between the bandwidths of the major subsystems in a high-performance
uniprocessor:
Bandwidth balancing between CPU and memory The speed gap between the CPU
and the main memory can be closed up by using fast cache memory between them.
The cache should have an access time ¢, = ¢,. A block of memory words is moved
from the main memory into the cache (such as 16 wards/block for the IBM 3033)
so that immediate instructions/data can be available most of the time from the
cache. The cache serves as a data/instruction buffer. Detailed descriptions of
cache memories will be given in Sections 2.4 and 7.3
BY + By = BY (1.6)
where Bo = B, and Bi, = B,, are both maximized. Achieving this total balance
requires tremendous hardware and software supports beyond any of the existing
systems.
WWW.Gitmgurgaon.blogspot.com
saBBI01S
ei
AI@puodas
Py aed
(Sulxajdnprg :
ro
“Suliaj ng) ®
(paatayaiuy)
YVIV
° Guyeiy)
° e auyoeu 3
So)Aa]
*
e e aseqriep
pep
pue sajnpou AoW, 40 $13][021U09
suononuysul aolaap agtaaq
1Gj
a | spuURYys O/T quad yard]
siays1say Allows
aye’) oe ee °
*
Co) .
RR
AdS AIOUIOUE UTR IY waisks O/] BoTAaCE
we
Cp Cr) Ca) orl
18 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Multiprogramming Within the same time interval, there may be multiple processes
active in a computer, competing for memory, 1/0, and CPU resources. We are
aware of the fact that some computer programs are CPU-bound (computation
intensive), and some are //O-bound (input-output intensive). We can mix the
execution of vatious types of programs in the computer to balance bandwidths
among the various functional units. The program interleaving is intended to
promote better resource utilization through overlapping 1/O and CPU operations.
As illustrated in Figure 1.9, whenever a process P, is tied up with I/O opera-
tions, the system scheduler can switch the CPU to process P,. This allows the
simultaneous execution of several programs in the system. When P, is done,
the CPU can be switched to P;. Note the overlapped I/O and CPU operations and
the CPU wait time are greatly reduced. This interleaving of CPU and I/O opera-
tions among several programs is called mulriprogramming. The programs can be
mixed across the boundary of user tasks and system processes, in either a mono-
programming or a multiprogramming environment. The total execution time is
reduced with multiprogramming. The processes P,, P,,.... may belong to the
same or different programs.
19
guissazoid paleys-auty (2)
aseyd O/T
aindino:o
aindwos :9
yond 27
~+—____ paa’s aur.
‘
'
L
'
1
1
'
WWW.Gitmgurgaon.blogspot.com
1
i
'
i
1
t
i
1
1
1
1
1
1
1
!
+
on
i
/
' aseud Q/T :
i paaes t
3
b-—_
‘ aU :1
'L
‘
1
ié t1
,
1
i
1
1 aseyd dD
I
' fy %
1
1
i
4
i
q
' Bulssasoid yneg (Pp)
1
fq ee
Gc tp or
5 zy lo
t z
d af
20 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
SEP 8,
13 1.3 PARALLEL COMPUTER STRUCTURES
Parallel computers are those systems that emphasize parallel processing. The basic
architectural features of parallel computers are introduced below. We divide
parallel computers into three architectural configurations:
1 3, 5, S, (Stages)
IF me ID ~ OF a EX bee
Pipeline
stages
OrP
BEX i, iL i; i fy eee
LF) f, i, 4 4 I, ee6e
] 2 3 4 3 6 7 8 9 Time
(pipeline
, . cycles)
(6) Space-time diagram for a pipelined processor
Stages
‘
ofp o/p ofp
EX | fi 4, 4 eee
or| 4 4 i, eee
ID i i, 4, eee
IF} 4, h, /, h, eee
! 2 3 4 5 6 7 8 9 10 Ib 8 13 Time
cycle. The instruction cycle has been effectively reduced to one-fourth of the
original cycle time by such overlapped execution.
Theoretically, a k-stage linear pipeline processor could be at most k times
faster, We will prove this in Chapter 3. However, due to memory conflicts, data
dependency, branch and interrupts, this ideal speedup may not be achieved for
out-of-sequence computations. What has been described so far is the instruction
pipeline. For some CPU-bound instructions, the execution phase can be further
partitioned into a multiple-stage arithmetic logic pipeline, as for sophisticated
WWW.Gitmgurgaon.blogspot.com
22 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
WWW.Gitmgurgaon.blogspot.com
yolag
JO}334,
Pal nese seseee
JOssa301d 101534 x (ap (ay S] } Aroma beg pe
(40) apooap yoy * uew | oy
uOTaT2ISU HOTT ISU]
sauyadid aepeag
essences eeseeessaeese phy
rn ¥y Jepesg
siaysiBai
ae[e0g seinen
wep IE]BIS -
Jossazo1d IRTESS
Single instruction and Multiple
24 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
data
i.e single CU controls multiple
1/o
ALU's
A .
Data
\
bus
¥ PE,
yw OPE, y PE,
Pedi] lel e@ 6
>
M
M (Array
M
processing)
| ee i:
Y
Inter-PE connection network
(data routing)
Figure 1,12 Functional structure of an SEMD array processor with concurrent scalar precessing in the
conirol unit.
array processor is depicted in Figure 1.12. Scalar and control-type instructions are
directly executed in the control unit (CU), Each PE consists ofan ALU with registers
and a local memory. The PEs are interconnected by a data-routing network. The
interconnection pattern to be established for specific computation is under program
control from the CU. Vector instructions are broadcast to the PEs for distributed
execution over different component operands fetched directly from the local
memories. Instruction fetch (from local memories or from the control memory)
and decode is done by the control unit. The PEs are passive devices without in-
struction decoding capabilities.
INTRODUCTION TO PARALLEL PROCESSING 25
These organizations and their possible extensions for multiprocessor systems will
be described in detail in Chapter 7. Techniques for exploiting concurrency in
multiprocessors will be studied, including the development of some parallel
language features and the possible detection of parallelism in user programs.
Special memory organization for multiprocessors will be treated in Section 7.3.
We will cover hierarchical virtual memory, cache structures, parallel memories,
WWW .Gitmgurgaon.blogspot.com
“ways4s aossasq20 [NW CIATIA 06 jo uaasap [EBOIUN] CF) andy
YIOMIOE
ydnaisyat
Jossa201dayH]
Jossacoid 4
AIOUSLL [BIO] 2A T
apnpow Azone SAA
AJOWSUL
pareys
tte A
e
. acd [nut 49 e
SIOMPOu
JEQSSGII '$33nQ) *
. YIOMIBU BORABUUOD uOHasHUOSIIIUI
STOU yndino-jnduy
- 108809010
T1378 |
26
spuURLD O/|
INTRODUCTION TO PARALLEL PROCESSING 27
WWW.Gitmgurgaon.blogspot.com
28 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
4 {speedup}
A
1024
S12
256
128
6ab &
32 *
: «
L aX el
8 _e— unm * —
Paed
ZL | i i 1 i —_ mr
2 4 8 (6 32 64 128 256 512 {O24
Number of processors
Figure 1.14 Various estimates of the speedup of an #-processor system over a single processor.
the problem on an n-processor system is given below, where the summation repre-
sents 7 operating modes.
y I
=) fed =O (1.7)
ps]
S= =tTtT (1.8)
n it
ss
2&1 Inn
LG
For a given multiprocessor system with 2, 4, 8, or 16 processors, the respective
average speedups (using Eq. 1.8) are 1.33, 1.92, 3.08, and 6.93. The speedup obtained
INTRODUCTION TO PARALLEL PROCESSING 29
in Eq. 1.8 can be approximated by n/In » for large n, For example, $ = 1000/In 1000
= 144,72 for asystem with 4 = 1000 processors. We have plotted the upper bound,
the lower bound, and the speedup using Eq. 1.8 in Figure 1.14.
The above analysis explains the reason why a typical commercial multi-
processor system consists of only two to four processors, Dr. John Worlton of the
United States Los Alamos Scientific Laboratory said once: “The designers of
supercomputers will do better at exploiting concurrency in the computing problems
if they use a small number of fast processors instead of a large number of slower
processors.” This conclusion coincides with the analytical prediction given in
Eq. 1.8.
To measure the real performance ofa computer system, one cannot ignore the
computation cost and the ease in programming. Comparing multiprocessor
systems with other computer structures, we conclude the following: Pipelined
uniprocessor systems are still dominating the commercial market in both business
and scientific applications. Pipelined computers cost less and their operating
systems are well developed to achieve better resource utilization and higher
performance. Array processors are mostly custom designed, For specific applica-
tions, they might be effective. The performance/cost ratio of such special-purpose
machines might be low. Programming on an array processor is much more
difficult due to the rigid architecture. Multiprocessor systems are more fiexible in
general-purpose applications, Pipelined multiprocessor systems represent state-
of-the-art design in parallel processing computers. Many of the computer manu-
facturers are taking this route in upgrading their existing systems.
Data flow computers The conventiona] von Neumann machines are called contro/
flow computers because instructions are executed sequentially as controlled by a
program counter. Sequential program execution is inherently slow. To exploit
maximal parallelism in a program, data flow computers were suggested in recent
years, The basic concept is to enable the execution of an instruction whenever its
required Operands become available. Thus no program counters are needed in
data-driven computations. Instruction initiation depends on data availability,
independent of the physical location of an instruction in the program. In other
words, instructions in a program are not ordered. The execution follows the data
dependency constraints. Theoretically, maximal concurrency can be exploited in
such a data flow machine, constrained only by the hardware resource availability.
Programs for data-driven computations can be represented by data flow graphs.
An example data flow graph is given in Figure 1.15 for the calculation of the follow-
ing expression:
z=(x + y) #2 {1.9)
WWW.Gitmgurgaon.blogspot.com
30 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
3 l6
Operation
unit(s)
Message link
Update Fetch
Activity
store
+ G/C = gates/chip, »M = micron (107° meter), KB = 1024 bits, and KD = 1624 devices (transis-
tors or diodes).
WWW.Gitmgurgaon.blogspot.com
32 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Is, - IS,
cu, poy PU, DS
18, 18,
wt CU, pe] PU, SM
. ° MM, MM,; © @@ (MM,
e .
e e
Is, 4 IS, 9 + eee
CU, -—__—} PU, DS
Y [S B, IS,
~ Y
IS,
15, 1S, Ds, >
ee] CU, pe] PU, at > MM, —
IS, Is, BS, 1S,
~ Cu, 7 PU, be — > MM, e
. ? + .
. ° SM @ .
is, (= is, . Ds, : Gi
a CU, | PU, be >| |MM,,,
Computer class Computer system models (chapters where the system ts quoted or described)
SID IBM 701 (1); [BM [620 (1); IBM 7090 (1); POP VAX[1/780 (L).
(uses one
functional unit)
SISD IBM 360/91 (3): IBM 370/168UP (1); CDC 6600 (1); CDC Star-100 (4);
(with multiple TI-ASC (4): FPS AP-[20B (4): FPS-64 (4): IBM 3838 (4): Cray-1 (4):
functional units) CDC Cyber-205 (4): Fujitsu VP-200 (4): CDC-NASF (4): Fujitsu
FACOM -230;75 (4).
SIMD Iliac-[V (6}: PEPE (1): BSP (6)
(word-slice
processing)
SIMD STARAN (1); MFP (6); DAP (1).
(bit-slice
processing}
MIMD IBM 370/168 MP (9}; Univac [100/80 (9); Tandem/
16 (9); [BM 3081/3084 (9);
(loosely C.m* (9)
coupled)
MIMD Burroughs D-825 (9); C.mmp (9); Cray-2 (9).
(tightly 8-1 (9): Cray-% MP (9); Denelcor HEP (9)
coupled)
INTRODUCTION TO PARALLEL PROCESSING 35
_ P, _ oP 1.11
Ep FP (LH)
If the computing power of the processor is fully utilized (or the parallelism is fully
exploited), then we have P; = P for all/and y« = 1 for 100 percent utilization. The
utilization rate depends on the application program being executed.
Figure 1.17 demonstrates the classification of computers by their maximum
parallelism degrees. The horizontal axis shows the word length n. The vertical axis
corresponds to the bit-slice length m. Both length measures are in terms of the
number of bits contained in a word or in a bit slice. A dir slice is a string of bits, one
from each of the words at the same vertical bit position. For example, the TASC
has a word length of 64 and four arithmetic pipelines. Each pipe has eight pipeline
stages. Thus there are 8 x 4 = 32 bits per each bit slice in the four pipes. TI-ASC
is represented as (64, 32). The maximum parallelism degree P(C) of a given com-
puter system C is represented by the product of the word length n and the bit-slice
length m; that is,
WWW.Gitmgurgaon.blogspot.com
36 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
MPP
16,384} (1, 16384)
PEPE
gph bn we (32, 288)
Staran
286 |-@ (1, 256)
yo
= : Iliac IV
eH Gb fe boteteettecasnees ce (64, 64)
2
a
: Cc, mmp
16 pre cree (16, FB)
1 16 32 64
Word length (1)
Figure L17 Feng’s classification of computer systems in terms of parallelism exhibited by word length
and bit-slice length.
There are four types of processing methods that can be seen from this diagram:
computers. WPBS (n = {,m > 1) has been called bis (bit-slice) processing because
an m-bit slice is processed at atime. WSBP (n > 1, m = 1), as found in most existing
computers, has been called word-slice processing because one word of r bits is
processed at a time. Finally, WPBP {1 > 1, > 1) is known as fully parailel pro-
cessing {or simply parallel processing, if no confusion exists), in which an array of
n-m bits is processed at one time, the fastest processing mode of the four. In
Table 1.4, we have listed a number of computer systems under each processing
mode. The system parameters n, m are also shown for each system. The bit-slice
processors, like STARAN, MPP, and DAP, all have long bit slices. Illac-IV and
PEPE are two word-slice array processors. Some of these systems will be de-
scribed in later chapters.
WWW.Gitmgurgaon.blogspot.com
38 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
The functions of PCU and ALU should be clear to us. Each PCU corresponds to
one processor or one CPU. The ALU is equivalent to the processing element (PE)
we specified for SIMD array processors. The BLC corresponds to the combina-
tional logic circuitry needed to perform 1|-bit operations in the ALU.
A computer system C can be characterized by a triple containing six inde-
pendent entities, as defined below:
T(C) = <K x K'.D x DW x W'S (1.13)
where K = the number of processors (PC Us) within the computer
D= the number of ALUs (or PEs) under the control of one PCU
W = the word length of an ALU or ofa PE
It
T(C.mmp) = (16,1, 16> + <1 « 16, 1, 16> + <1, 16, 16> (1.16)
P P eee P PBP-11 processors
j A A
‘ Y Y¥
M M eee M Shared memories
P P eee P
A A
Y f
Parallel connections
: ee in the crossbar switch
j j A
f f Y
M M eee M
Ds DS DS
P -_ Pp — we 2886 —Pr
DS j J DS
Is 1s IS 1 IS: instruction stream
DS: data stream
Crossbar switch
Figure 1.18 Gperation modes in C.mmp system (all double-arrowed paths are for both [5 and DS).
38
WWW.Gitmgurgaon.blogspot.com
40 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
T(TL-ASC) «1, 4, 64 x 8)
T(CDC-6600) {1,1 * 10,605 x 10, 1, 12>
central Vo
processor processors
T(liae 1V) €1, 64, 64>
TCMPP) <1, 16384, 15
TC.mmp) <16, 1,165 + <1 x 16,1, 16> + <1, 16, 16>
T(PEPC) <1 x 3, 288, 32>
T(IBM 360/91) (1,3, 64 x (3 ~ 5)
T(Prime) €5, 1, 16>
T(Cray-1) <1,12 x &t, 64 x (1 ~ 14)>
T(AP-120B) <1, 2,38 x (2 ~ 3)>
Fast and efficient computers are in high demand in many scientific, engineering,
energy resource, medical, military, artificial intelligence, and basic research areas.
Large-scale computations are often performed in these application areas. Parallel
processing computers are needed to meet these demands. In this section, we
introduce some representative applications of high-performance computers.
Without using superpower computers, many of these challenges to advance
human civilization could hardly be realized. To design a cost-effective super-
computer, or to better utilize an existing parallel processing system, one must
first identify the computational needs of important applications. With rapidly
changing application trends, we introduce only the major computations and leave
the readers to identify their own computational needs in solving each specific
problem.
Large-scale scientific problem solving involves three interactive disciplines:
theories, experiments, and computations, as shown in Figure 1.19. Theoretical
scientists develop mathematical models that computer engineers solve numerically;
the numerical results may then suggest new theories. Experimental science provides
INTRODUCTION TO PARALLEL PROCESSING dE
Experimental
(physicists, engineers,
chemists, btologists)
Suggest and Generate data
test theary
Theoretical Computational
(mathematicians, (computer scientists,
physicists, chemists, digital engineers,
logicians) computational physicists)
>> “t
ttt
Provide equations, Accurate calculations,
interpret results large-scale calculations,
suggest theory
Figure 1.19 Interaction among experiments, theories, and computations to selve large-scale scientific
problems (Courtesy of Rodrique et al., EEE Computer, 198).
data for computational science, and the latter can model processes that are hard
to approach in the laboratory. Using computer simulations has several advantages:
. Computer simulations are far cheaper and faster than physical experiments.
2. Computers can solve a much wider range of problems than specific laboratory
equipments can.
3. Computational approaches are only limited by computer speed and memory
capacity, while physical experiments have many practical constraints.
WWW.Gitmgurgaon.blogspot.com
42 COMPUTER ARCHITECTURE ANID PARALLEL PROCESSING
A. Numerical weather forecasting Weather and climate researchers will never run
out of their need for faster computers. Weather modeling is necessary for short-
range forecasts and for long-range hazard predictions, such as flood, drought, and
environmental pollutions. The weather analyst needs to solve general circulation
model equations with the computer. The atmospheric state is represented by the
surface pressure, the wind field, temperature, and the water vapor mixing ratio.
These state variables are governed by the Navier-Stokes fluid dynamics equations
in a spherical coordinate system.
The computation is carried out on a three-dimensional grid that partitions the
atmosphere vertically into K levels and horizontally into M intervals of longitude
and N intervals of latitude (Figure 1.20). A fourth dimension is added as the
number FP of time steps used in the simulation. Using a grid with 270 miles on a
side, a 24-hour forecast would need to perform about {00 billion data operations.
This forecast could be done on a 100 megafiops computer in about 100 minutes.
Altitude
} ae —~ Longitude
Latitude ”
: Three-dimensional
‘grids
‘
a
Atmosphere
Figure 1.26 The general circulation model for three-dimensional global atmosphere simulation used in
numerical weather forecasting and climate studies.
INTRODUCTION TO PARALLEL PROCESSING 43
This 270-mile grid gives the forecast between New York and Washington, D.C.,
but not for Philadelphia, about halfway between.
Increasing the forecast by halving the grid size in all four dimensions would
take the computation at least 16 times longer. The 100 megaflops machine, like a
Cray-1, would therefore take 24 hours to complete the 24-hour forecast. In other
words, to halve the grid size, giving the Philadelphia weather, requires a computer 16
times more powerful (1.6 gigaflops) to finish the forecast in 100 minutes. Reliable
long-range forecasts require an even finer grid for a lot more time steps, and thus
demand a much more powerful computer than the 1.6 gigaflops machine.
B. Oceanography and astrophysics Since oceans can store and transfer heat and
exchange it with the atmosphere, a good understanding of the oceans would help
in the following areas:
Oceanographic studies use a grid size on a smaller scale and a time variability on a
larger scale than those used for atmospheric studies. To do a complete simulation
of the Pacific Ocean with adequate resolution (1° grid) for 50 years would take
1000 hours on a Cyber-205 computer.
The formation of the earth from planetesimals in the solar system can be
simulated with a high-speed computer. The dynamic range of astrophysic studies
may be from billions of years to milliseconds. Interesting problems include the
physics of supernovae and the dynamics of galaxies. Three-dimensional, n-body
integrations rat in such a study, involving 10° particles moving self-consistently
under Newtonian forces. The Hliac-IV array processor was used in this study.
WWW.Gitmgurgaon.blogspot.com
44 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
listed factors. In contrast, computer flow simulations have none of these physical
constraints, but have their own: computational speed and memory capacity.
Two gigaflops supercomputers, known as the Numerical Aerodynamic Simu-
lation Facilities (NASF), have been proposed by the Burroughs Corporation and
by the Control Data Corporation. These are specialized * Navier-Stokes” machines,
capable of simulating complete aircraft design for both the U.S. government
and commercial aircraft companies. We will study the proposed designs, along
with their predecessor vector processors, in Chapters 4 and 6,
Image processing
Pattern recognition
Computer vision
Speech understanding
Machine inference
CAD/CAM/CAIT/OA
Intelligent robotics
Expert computer systems
Knowledge engineering
WWW.Gitmgurgaon.blogspot.com
46 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Geographic reference,
calibration data, etc.
Sensor On-board
preprocessing
Data storage
Data Information
\Preprocessin
analysis consumption
J 1 ]
Human participation
with ancillary data
Figure 1.21 Computer analysis of remotely sensed earth resource data (Courtesy of Swain, McGraw-
Hill tnternational, 978).
A. Seismic exploration Many oil companies are investing in the use of attached
array processors or vector supercomputers for seismic data processing, which
accounts for about 10 percent of the oil finding costs. Seismic exploration sets off
a sonic wave by explosive or by jamming a heavy hydraulic ram into the ground
and vibrating it ina computer-controlled pattern. A few thousand phones scattered
about the spot are used to pick up the echos. The echo data are used to draw
two-dimensional cross sections that display the geometrical underground strata.
Reconstruction techniques are being used to identify the types of strata that may
bear oil. Such seismic exploration may save the drilling of many dry holes.
A typical field record for the response of the earth to one sonic input has 3000
different time values, each at about 48 different locations. This produces about 2 to 5
million floating-point numbers per kilometer along a survey line. In 1979 alone,
10! bits of seismic data were processed. One geophysical company in Houston
has about 2 million magnetic reels of seismic data in inventory and 300,000 reels
awaiting processing, The demand of cost-effective computers for seismic signal
processing is increasing sharply.
C. Plasma fusion power Nuclear fusion researchers are pushing to use a computer
100 times more powerful than any existing one to model the plasma dynamics in
the proposed Tckamak fusion power generator. Magnetic fusion research pro-
grams are being aided by vector supercomputers at the Lawrence Livermore
National Laboratory and at Princeton’s Plasma Physics Laboratory. The potential
for magnetic fusion to provide an alternate source of energy has become closer
as a result of the cooperative effort of the experimental program with the com-
putational simulation program.
Synthetic nuclear fusion requires the heating of plasma to a temperature of
100 million degrees. This is a very costly effort. The high-temperature plasma,
consisting of positively charged ions and negatively charged electrons, must be
WWW.Gitmgurgaon.blogspot.com
48 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
D. Nuclear reactor safety Nuclear reactor design and safety control can both be
aided by computer simulation studies. These studies attempt to provide for:
The importance lies in the above operations being done in real time. For
light reactor safety analysis, a TRAC code has been developed to simulate the non-
equilibrium, nonhomogeneous flow of high-temperature water and steam. Another
code, Simmer IE, has been developed to analyze core melting in a fast breeder
reactor, Only supercomputers can make these calculations possible in real time.
€. Weapon research and defense So far, military research agencies have used the
majority of the existing supercomputers. In fact, the first Cray-1 was installed
at the Los Alamos Scientific Laboratories in 1976. By 1981, four upgraded Cray 1’s
had been acquired by Los Alamos. Listed below are several defense-related
military applications of supercomputers.
Parallel processing computers have been treated in parts of the books by Hayes
(1978), Kuck (1978), Stone (1980), and Baer (1980). Enslow (1974) and
Satyanarayanan (1980) devoted their books to multiprocessor systems. The
book by Hockney and Jesshope (1982) covers only pipeline and array processors.
WWW.Gitmgurgaon.blogspot.com
50 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Problems
1.1 Distinguish among computer terminologies in each of the following groups:
(a) Data processing, information processing, knowledge processing, and intelligence processing.
(4) Batch processing, mulliprogramming, time sharing, and multiprocessing.
(c) Parallel processing at the job level, the task level, the interinstruction level, and the intra-
instruction level,
(@) Uniprocessor systems versus multiprocessor systems.
(e) Parallelism versus pipelining.
(/) Serial processing versus parallel processing.
(g) Control flow computers versus data low computers.
1.2 Existing computer systems are classified in Tables 1.3, 1.4, and 1.5, based on the three architectural
specification schemes given in Section 1.4. The listing in each table is not complete. Enter the specifica-
tion of at least two additional computer systems under each architectural category of each of the three
tables. Use the same specification format for the existing entries in making the new entries.
13 The speedup of using » processors over the use of one processor in solving a computing problem
was analyzed in Section 1.3.4 under various assumptions, such as f, = I/# and d; = 1/f for f= 1,
a
{a) Repeat the performance speedup analysis to derivea new speedup equation (similar to Eq. 1.8),
under the following new probability distributions of operating modes,
i
dy a fori=1,2,...," C.17)
vi
ral
itl
forfi=1,2...,7 (1.18)
(c) The case in (a) favors the assignment of the computing task to a larger number of processors,
whereas the case in (b) favors the assignment to a smaller number of processors. The case presented in
Section ],3.4 treats all possible task divisions equally, Plot the new speedup curves obtained in case (a)
and in case (6) along with plots given in Figure 1.14. Can you find new upper bounds for the new
speedup curves? Derive the upper bound, if it exists.
INTRODUCTION TO PARALLEL PROCESSING 51
1.4 Name three distinct characteristics that exist in the ith generation computers for / = 1, 2, 3, and
4 but not in the jth generation for j = 0,1,2,...,2— J, where the Oth generation corresponds to prior
electronic computers.
1.3 Match each of the following computer systems to the phrase that best describes it.
om— Hhiac-T¥ (1) Acluster of microprocessors
~—— TI-ASC (2) A vector processor made in Japan
wer CDC-7600 (3) A supermini computer with virtual memory
—_.. IBM 460/91 (4) The firs! MIMD multiprocessor consisting of 16 PDP IT minicomputers
—_.. AP-120B (5) The first IBM computer using the thermal conduction modules
ee Cray-l (6) A multiprocessing vector processor by Cray Research
—.. B-5500 (7) A major computer project at IBM in the 1960s
——— PEPE (8) An array processor with 64 PEs
sme Cybor-205 (9) A multifunction computer with multiprocessing in I/O subsysiem
ou C.mmp (10) The first operational electronic digital computer
——. BSP (11) An associate processor with 288 PEs
——. MPP (2) A commercial multiprocessor with a packet switched interconnection net-
work
—— Cray X-MP (13) The first IBM scientific processor with multiple functional units
— HEP (14) An attached array processor for minicomputers
mem ¥P-2.00 (15) A first-generation pipelined vector processor
—— ENIAC (16) An array processor with 16384 PEs
—— Stretch (17) A CDC vector processor enhanced from the STAR-100
wee C.m* (18) One of the first stack computers
woe VAX 11/780 (19) An array processor with 16 PEs and shared memories
~—m— IBM 3081 (20) A vector processor with 12 pipes and large register files
1.6 You were briefed about 15 important applications of paraliel processing computers in Section 1.5.
Choose the one of these application areas that interests you most for an indepth study. Dig out more
information from the library or request the source information from any application site of super-
computers that you know of. Prepare a study report based on your readings and observations in the
chosen area of supercomputer applications.
1.7 In the following block of computations, a and 4 are two external inputs and z is the final output.
Two intermediate results are labelled x and y.
(a) Draw a data flow graph for this code block, where «, +, —, and / are arithmetic operators.
(6) Show a template implementation of the data flow graph in (a).
(ec) Indicate the events that can be done in paraile! in the execution of the above block of codes.
“1.8 Describe at least four characteristics of MIMD multiprocessors that distinguish them from
multiple computer systems or computer networks,
1.9 Prove that a &-stage linear pipeline can be at most & times faster than that ofa nonpipelined serial
processor.
1.10 Summarize all forms of parallelism that can be exploited at different processing levels ofa com-
puter system, including both uniprocessor and multiprocessor approaches. Discuss hardware, firmware,
and software supports needed 10 achieve each form of parallelism. Indicate example computers that
have achieved various forms of parallelism.
WWW.Gitmgurgaon.blogspot.com
CHAPTER
TWO CO- 2
Memory systems for parallel processor computers are described in this section.
We begin with the hierarchical memory structures and the concept of virtual
memory. Virtual memory concepts are discussed for paged systems, segmented
systems, and systems with paged segments.
WWW.Gitmgurgaon.blogspot.com
S54 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
fastest memory. The cache is used to capture the segments of information which are
most frequently referenced by the processor. Information transfer between the
processor and the cache is on a word basis. Cache memories will be discussed in
detail in Section 2.3, The next lower level of memory consists of modules My,y to
M,_, and constitutes the main memory. The four modules are usually designed with
metal oxide semiconductor (MOS) or ferromagnetic (core) technology, and the
Cost/byte A
“+. Cache
\ (bipolar)
/
‘Main memory
(MOS)
~~. Moving-head
4 disk
—
Access time
Figure 2.1 Cost and access time relationship.
MEMORY ANID INPUT-OUTPUT SUBSYSTEMS 55
Processor memory
infercontiection network secondary memory
Local Fixed-head disks
memory Main or drums
Processor (cache) memory
Po Mio bo M9 My
M;,
Channels
M,
“HE SA us
Level! 1 2 3
Access time t ty i‘,
Memory capacity (bytes) 5, Sy 5;
Cost per byte fy cy
unit of information transfer between the main memory and cache is a block of
contiguous information (typically 2 to 32 words), The primary memory may
be extended either with the so-called large core storage (LCS) or with extended
core storage (ECS), both of which are made of slower core memories. The average
access time of the primary memory and its extensions are in the order of 0.5 ys
and 5 us, respectively.
There exists a technological gap between the primary and secondary memory,
as evidenced by the access time characteristics shown in Table 2.1. Average access
time of secondary memories is 1000 to 10,000 times slower than that of primary
memories. Electronic disks, such as CCDs and MBMs, have not proved cost-
effective in closing the technological gap and thus have had little impact in the
design of memory systems. Hence, as shown in Figure 2.2, the secondary memories
most often used are disks and drisms.
SEP 17 The processor usually references an item in memory by providing the location
or address of that item. A memory hierarchy is usually organized so that the address
space in level / is a subset of that in level 7 + 1. This is true only in their relation,
however; address A, in level / is not necessarily address A, in level i + 1, but any
information in leveli may also exist in leveli + 1. However,some of the information
in level ? may be more current than that in level # + 1.
This creates a data consistency or coherence problem between adjacent levels
because they have different copies of the same information. Usually level i + 1 is
eventually updated with the modified information from level i. The data consistency
WWW.Gitmgurgaon.blogspot.com
S@ COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
problem may also exist between the local memories or caches when two cooperat-
ing processes, which are executing concurrently or on separate processors, interact
via one or more shared variables. One process may update the copy of a shared
variable in its local memory while the other process continues to access the
previous copy of the variable in its local memory. This situation may result in the
incorrect execution of the cooperating processes. In general, a memory hierarchy
encounters such a coherence problem as soon as one of its levels is split into several
independent units which are not equally accessible from faster levels or processors.
Solutions to data consistency problems are discussed in Sections 2.3, 2.5, and 7.3,
In modeling the performance ofa hierarchical memory, itis often assumed that
the memory management policy is characterized by a success fiinction or hit ratio
H, which is the probability of finding the requested information in the memory of
a given level. In general, H depends on the granularity of information transfer,
the capacity of memory at that level, the management strategy, and other factors.
However, for some classes of management policies, it has been found that H is
most sensitive to the memory size s. Hence the success function may be written as
H{s). The miss ratio or probability is then F(s) = | — H(s). Since copies of infor-
mation in level i are assumed to exist in levels greater than i, the prebability ofa hit
at level i and of misses at higher levels I to i — 1, is:
hy = H(s,) — Hts,..:) (2.1)
where ji, is the access frequency at level i and indicates the relative number of
successful accesses to level i. The missing-item fault frequency at level i is then
T= 4% (2.2)
MEMORY AND INPUT-OUTPUT SUBSYSTEMS 57
In general, ¢, includes the wait time due to memory conflicts at level k and the delay
in the switching network between levels k — 1 and k. The degree of conflicts is
usually a function of the number of processors, the number of memory modules,
and the interconnection network between the processors and memory modules.
In most systems, a request for a word which is not in memory level i causes the
block of information which contains the requested word to be transferred from
level 1 + 1 to level i. When the block transfer to level 1 has been completed, the
requested word is accessed in the loca! memory.
The effective access time for each memory reference in the n-level memory
hierarchy is
T= VAT, (2.3)
r= SHO izt
T= 3 Fo fed
(2.5)
If c(é,) is the cost per byte of memory at level i which is expressed as a function of its
average access time, the total cost of the memory system is
n
C= ¥ elt)s; (2.6)
i=t
A typical memory-hierarchy design problem involves an optimization which
minimizes the effective hierarchy access time 7; subject to a given memory system
cost Cy and size constraints. That is, minimize T = $7... F(s;— 4), subject to
the constraints C = S"_, c(t)s, < Co, wheres; > Oand ¢, > 0, fori = 1,2,...,#.
In practice, the cost constraints should include the cost of the processor-memory
interconnection network.
In the memory types we have discussed so far, the contents of a memory loca-
tion is accessed by specifying the memory location or address of the item. In another
type of memory, associative memory, the data stored in the memory can be accessed
by specifying the contents or part of the contents. In this sense, associative memory
has also been known as content-addressable memory and paraliel search memory.
The major advantage of associative memory over the RAM is its capability of
WWW.Gitmgurgaon.blogspot.com
58 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
performing parallel search and comparison operations, which are needed in many
important applications, such as table lookup, information storage and retrieval
of rapidly changing databases, radar-signal tracking and processing, image
processing, and real-time artificial intelligence computations. The major disad-
vantage of associative memory is its much increased hardware cost. Currently,
associative memories are much more expensive than RAMs, even though both are
built with integrated circuitry. However, with the rapid advent of VLSI technology,
the price gap between these types of memories may be reduced in the future.
Associative memories and associative processors will be treated in Section 5.4.
AB DB AB DB AB DB
time, a previous memory request would not have completed its access before the
arrival of the next request, thereby resulting in a delay.
In array processots, if the data elements of a vector reside in the same module,
there will be insignificant parallelism in computation because the elements cannot
be fetched simultaneously by all processors for the *lock-step” manipulation.
The high-order interleaving can be used without conflict problems in multipro-
cessors if the modules are partitioned according to disjoint or noninteracting
processes. In practice, however, processes interact and share instructions and
data in multiprocessor systems and will thereby encounter considerable conflicts
in a high-order interleaved memory subsystem. For the above reasons, low-order
interleaving is frequently used to reduce memory interference.
An advantage of high-order interleaving is that it provides better system
reliability, since a failed module affects only a localized area of the address space
and therefore provides graceful degradation in performance. The failed module
can be logically isolated from the system and the memory manager can be informed
so that no process address space is mapped into the failed module. A failure of
any single module in the second scheme will almost certainly be catastrophic to
the whole system. The second scheme, however, seems preferable if memory inter~
ference is the only basis of choice.
A compromise interleaving technique is to partition the module address
field into the two sections S,,., and 5, so that section S, is the least significant
WWW.Gitmgurgaon.blogspot.com
60 COMPUTER ARCHITECTURE ANID PARALLEL PROCESSING
MS AL
a yr
Decoder
non
AB: address buffer i
DB: data buffer
Y Y Y
AB DB AB DB AB DB
Figure 2.4 Parallel memory system with consecutive words in consecutive modtles.
r bits of the memory address and section S,,_, is the high-order m-r bits of the
address. Notice that the module address is formed by the concatenation of section
S-, and S,. In this scheme, the addresses are interleaved among groups of 2°
memory modules. This tends to reduce memory interference to a segment of shared
data. The memory system is expandable in blocks of 2” modules; however, a single
module failure disables an entire block of 2” modules, This scheme is appealing for
systems with a large number of memory modules if r is chosen to be very small.
SEP 22
2.2 VIRTUAL MEMORY SYSTEM
In many computer systems, programmers often realize that some of their large
programs cannot fit in main memory for execution. Even if there is enough main
memory for one program, the main memory may be shared with other users,
causing any one program to occupy some fraction of memory which may not
be sufficient for the program to execute. The usual solution is to introduce manage-
ment schemes that intelligently allocate portions of memory to users as necessary
for the efficient running of their programs. The use of virtual memory to achieve
this goal is described in this section.
LOGICAL ADDRESS SPACE> REAL ADDRESS SPACE/ PHYSICAL ADDRESS/ MAIN MEMORY ADDRESS
MEMORY AND INPUT-OUTPUT SUBSYSTEMS 6%
Vi= {(Q1,....1-
1}
Assume that the memory space allocated to the program in execution has m
locations. This space can be represented as a sequence of addresses:
M = {0,1,...,im— 1}
since main memory can be regarded as a linear array of locations, where each
location is identified bya unique memory address. Also, since the allocated memory
space may vary with program execution. m is a function of time.
At any time t and for each referenced name x e V,, there is an address map
address map table is
Ato. Wj Mu id} available/ STORED in the
main memory
WWW.Gitmgurgaon.blogspot.com
62 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
SEP 27 Program locality The sequence of references made by the /th program in execution
can be represented by a reference string RAT) = 7{l)r(2)...r¢{7), where
r(i) € V, is the ith virtual address generated by processj. It is common knowledge
that the virtual addresses generated are nonrandom but behave in a somewhat
predictable manner. Such characteristics of programs are due to looping, se-
quential and block-formatted control structures inherent in the grouping of
instructions, and data in programs. These properties, referred to as the locality of
reference, describe the fact that over an interval of virtual time, the virtual addresses
generated by a typical program tend to be restricted to small sets of its name space,
as shown in Figure 2.5. For example, if one considers the interval A in Figure
2.5. the subset of pages referenced in that interval is less than the set of pages
addressable.
There are three components of the locality of reference, which coexist in an
which is currently available in the main memory
active process, These are temporal, spatial, and sequentiatity localities. In temporal
locality, there is a tendency for a process to reference in the near future the elements
of the reference string referenced in the recent past. Program constructs which
lead to this concept are loops, temporary variables, or process stacks. In spatial
locality there is a tendency for a process to make references to a portion of the
virtual address space in the neighborhood of the last reference. The principle of
sequentiality states that if the last reference was r,(z), then there is a likelihood that
the next reference is to the immediate successor of element ri). Traversals of a
sequential set of instructions and arrays of data enforce spatial and sequentiality
localities. It should be noted that each process exhibits an individual character-
istic with respect to the three types of localities.
Each type of locality aids or influences the characterization of an efficient
memory hierarchy. The principle of spatial locality permits us to determine the
size of the block to be transferred between levels. The principle of temporal
MEMORY AND [NPUT=QGUTPUT SUBSYSTEMS 63
Page 4 :
numbers
A Time.
WWW.Gitmgurgaon.blogspot.com
64 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
program. The former case is called static relocation; the latter is called dynamic
relocation. Static relocation makes it difficult for processes to share information
which is modifiable during execution. Furthermore, if a program is displaced from
main memory by mapping, it must be reloaded into the same set of memory
locations, thereby fixing or binding the physical address space of the program for
the duration of the execution. This constraint causes inefficient memory manage-
ment policies. Multiprogramming systems do not generally use static relocation
because of these and other disadvantages. In order to effectively utilize memory
resources, dynamic relocation is often used, in which the function /; is varied
during the execution of the programs.
One technique in performing dynamic relocation is to use a set of base or
relocation registers in which the content of a relocation register is added to the
virtual address at each memory access. In this case, the programs may be initially
loaded into memory using static relocation, after which they may be displaced
within memory and the contents of the relocation register adjusted to reflect the
displacement. Two or more processes may share the programs by using different
relocation registers.
WWW.Gitmgurgaon.blogspot.com
66 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Requested
access
(PTBR) type Virtual address
Page table
PTE(,)
FIC)RWX|M|P| PFA
random-access memory or register set to store the page table. For example, the
Xerox Sigma 7 processor has a 256 register-set of nine bits each and a page size of
512 words, This corresponds to a virtual memory of 2?’ words (128 K). A better
solution is to exploit the locality of reference property of programs and use an
associative map which consists of an N-entry translation look aside buffer (TLB).
Hence the TLB may contain the \ most recently accessed virtual page numbers
and their corresponding page-frame addresses.
In a system with a single virtual address space, all users reside in the same
virtual memory. Another method is to partition the virtual space into several
independent areas, allocating one to each active process. This can be accomplished
by using the high-order bits of the virtual page number as a process identification.
These bits with the PTBR can be used to select the page table of a process.
Yet another technique of maintaining multiple virtual address spaces is to
fix the virtual space and concatenate a system-generated process identification
with the virtual address. This is illustrated in Figure 2.7. For a multiprogrammed
processor, a page map entry typically consists of six fields: a virtual page number
i,, 4 process identification, the RWX, a modified bit (C), and the PFA in shared
memory. The process identification of the currently running process is in the
current process register (CPR) of the processor.
When a virtual address is generated by a running process, the virtual-to-real
address translation involves the associative comparison of the virtual page number
67
~deut aded Juysn wopessunsy ssasppe jead Oy pERpalA £°7 BaNSLy
Ssaippe [EDISAU
UOQUEPHEA
|}—__———t-
uonebdgG
“W'S Ul Wdd yay sses0y
4
(Vid)
WWW.Gitmgurgaon.blogspot.com
ALOWSLE UL SsoIppe SWE I3eg d 27 Ww | Pit KAN
4
Aarod
ssaippe sures; aded st Wad « {Wd quguisae doy
Jaqtuhu vote iuapt ssasoid si pid « OOFAWd
ssaaoud oj] ajeaud st ated—j = + QWd csctecieeeesgreinnes| 0) NE
paypow Wid aded--~=5 +» O)aWd $ dew ofeqg 4 -——-» ney o8eg
apou lostajodns
— p= ns EF : ai
A
4 4 pid xMY rvs
ssaippe JENA addy
$sa098
pajsanbay
68 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
i, with all the page map entries (PME) that contain the same process identification
as the current running process. If there is a match, the page-frame number is
retrieved and the physical address formed by concatenating the displacement
with the PFA. If there is no match, a page fault interrupt occurs, which is serviced to
locate the page. Moreover, if the page-access key presented by the virtual address
does not match the RWX field of the PME with a corresponding virtual page
number and PID, an access violation is trapped. When a referenced page is
modified, the modified bit C of the corresponding PME is set in the page map. This
bit may be used by the replacement and memory update policies.
When a page fault occurs because the virtual page number i, was not found in
the TLB, a dynamic address translation is requested, using the page table which is
resident in main memory. The virtual page number /,, used as an index, is added
to the page table address in ihe PTBR and the resulting address is used to access
the PTE as described earlier. If the PTE indicates that the page is not in main
memory, the running process is blocked or suspended. A context switch is then
made to another ready-to-run process while the page is transferred from drum or
disk to memory and the PTE entry updated. The page address on disk or drum
may be found in the address field of the PTE. The context or task switch involves
the saving of the state of the faulting process and restoring the state of the runnable
process in the processor.
The TLB is invalidated or its contents are saved in memory as part of the
faulting process. The task switch is made because the page-transfer operation is
slow compared to the processor speed. If the page is in memory, the TLB is up-
dated with the virtual page number and the page’s page-frame address pair before
the process resumes execution. Updating the TLB involves replacing one of its
entries if it is full. The entry chosen for replacement is usually the least recently
used entry. Additional control bits, such as a set of usage bits, are associated with
each page map entry. The usage bits determine which entry is overwritten during
the replacement policy. Sometimes a private bit P is associated with each page to
indicate that the page is private to a process or shared by a set of processes.
Pure paged memory systems can become very inefficient if the virtual space is
large. The size of a page table can become unreasonably large. For example,
consider a system with a 32-bit virtual address and a 1024 (1 K)-byte page size.
The page address field is thus 22 bits, assuming byte addressability. Hence, we
have 2?” page table entries! Assuming that we have an 8M-byte main memory,
there are 277/219 = 2!? page frames. Therefore, in the PTE we have a 13-bit
page-frame field, or approximately 4 bytes per PTE. The total space consumed by
a page table is thus 2** bytes! In such cases, the page table may have to be paged
also.
There are other disadvantages of a pure paged system. There are no mechan-
isms for a reasonable implementation of sharing. The size ofa program space is
not always an integral number of pages hence, oftentimes, internal fragmentation
occurs in memory because the last part of the last page is wasted. In addition, there
is another type of storage fragmentation called table fragmentation, which occurs
because some of the physical memory are occupied by the page tables and so are
MEMORY AND INPUT-OUTPUT SUBSYSTEMS 69
unavailable for assignment to virtual pages. The VAX 11/780 virtual memory
system is described below as an example of a paged memory system.
The virtual address of the VAX 11/780 is 32 bits wide and the page size is
2° = 512 bytes. For each reference, this address is translated, via a page map, to a
physical address that is 30 bits wide. The entry format of the page table is shown in
Figure 2.8. Bit 31 of the PTE represents the valid bit which, when set, indicates
that the referenced page is in main memory. Therefore, bits <20:0) of the PTE
contain the physical page-frame number of the page. If the valid bit is reset, bits
<20:0> of the PTE contain the invalid memory address of the referenced page. Thus
a page fault occurs and bits (25:05 of the PTE are used to determine the location
of the page on disk.
The modified bit (bit 26) of the PTE, if set, indicates that the page was modified.
Hence, the disk copy of the page must be updated when the page frame is de-
allocated. The modified bit is set on the first reference to the page. Bits <30:27>
of the PTE contain the protection mask or access privileges permitted on that
page. The protection mask is defined for four process types: kernel, executive,
supervisor, and user processes. In a memory reference, the requested access type
for the process is compared to the allowable accesses, if any. Access is denied if an
unpermitted access was requested.
Virtual address:
3i 98 o
Physical address:
29 98 0
Page frame number Byte offset
L Modified bit
Kermel WRT PARTICULAR PROCESSOR NO NEED TO LEARN
Executive PAGING CONCEPT
Protection mask: Supervisor
User GENERAL PAGING CONCEPT ALONE IS ENOUGH
Valid bit
Figure 2.8 Address and page table entry formats of VAX-11/780 virtual memory (Courtesy of Digital
Equipment Corp.}.
WWW.Gitmgurgaon.blogspot.com
7 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
PO (program)
regiori
>
Figure 2.9 Partitions of virtual address space (Courtesy of Digital Equipment Corp.).
For further memory protection, the virtual address is partitioned into two
Spaces, process and system. Each of the two spaces are further partitioned into
two regions. The process space consists of program (P0) and control (P1) regions,
as shown in Figure 2.9. These regions permit two directions of growth. The system
space consists of a system and unused regions. Bits of the virtual address (31:30>
are used to specify the addressed region. A page table is established for each region.
Each user process is assigned its own process space and, therefore, page tables for
its private program and control regions. However, all user processes share the
same system space.
Since all users share the same system space, there is only one page table for
the system space. This page table is called the system page table (SPT). The SPT
is described by two hardware registers: the system base register (SBR) and the
system length register (SLR). The SBR contains the starting physical address of
the SPT, which must be contiguous and cannot be paged.
Similarly, two hardware registers are allocated to each of the program and
control regions’ page tables of the user process. These registers are POBR and
POLR for the program region’s page table, and P1BR and PILR for the control
region's page table, as shown in Figure 2.10. These registers are always loaded
with the address and length of the page tables for the process in execution.
The process page tables are stored in the contiguous system space’s virtual
memory, therefore, the page table's base registers contain system space addresses
so that the process-space page tables can be paged. An address reference in the
process space requires a two-level address translation. The address translation
process is illustrated in Figure 2.11. To speed up the translation process, an associa-
tive page map (address translation buffer) is provided. It has 128 entries divided
MEMORY AND INPUT-OUTPUT SUBSYSTEMS 71
31 29 98 0
| | Virtual page number Offset
0G— PG region
01 —Pi region
10 System region
t]—-Unused
User
program
PO <
User stack
Pi «< Supervisor stack
Executive stack
Kerne] stack cae
=| Other process-specific
au code and data
into two 64-entry groups for process and system spaces. On a context switch, only
the process space entries are purged.
logical space is divided into segments (which are of unequal size bcos of various fns inside the pgm)
2.2.3 Segmented Memory System OCT 18
Programs which are block-structured and written in languages such as Pascal, C,
and Algol yield a high degree of modularity. These modules may include procedures
or subroutines which call other procedures. The modules are compiled to produce
machine codes in a logical space which may be loaded, linked, and executed. The
set of logically related contiguous data elements which are produced is commonly
called a segment, which is given a segment name. Segments are allowed to grow and
shrink almost arbitrarily, unlike pages, which have fixed sizes. Segmentation is a
WWW.Gitmgurgaon.blogspot.com
72 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
technique for managing virtual space allocation, whereas paging is a concept used
to manage the physical space allocation. In a segmented system, a user can define
a very large logical space, which can be managed efficiently. An element in a seg-
ment is referenced by the segment name-element name pair (¢s>, [/]). During
program execution, the segment name <s> is translated inte a segment address
by the operating system. The element name maps into a relative address or dis-
placement within the segment during program compilation.
A program consists of a set of linked segments where the links are created as a
result of procedure segment calls within the program segment. The method of
31 0
10] Virtual page number System virtual
address
SLR
29 ¥ y 9
System page table Page frame number | Offset
Physical memory address
Get virtual
address
Process Form system virtual (Add POBR or
space address of process PIBR +4*VPN)
is bit 3) = k? “age page table entry
System yes
space
Form physical Form physical address (Add VPN of
address of of system PTE Go map _ process vir-
system PTE process table) tual PTE ad-
dress to SBR}
Fetch system Fetch system PTE
PTE from from memory
memory '
Form physical address
of process PTE
Translation complete
System
page
table Physical address
of process PTE
PO or P]
*— A fault can occur at page learn till fig 2.12 fig
either of these accesses table explanation
to memory L
WWW.Gitmgurgaon.blogspot.com
74 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
starting address
of the segment and length
Segment table (ST) of the segment (in
segmentation)
Segment C a
<a> . in paging, all pages are of
7
CJ
same size hence no need to
store the length
_ 5
Segment
this is the address mapping
<a> «
e
diff b/w paging and
4
segmentation
Segment
a5> STE(s)
Address |IRWX| £L | F | ffSTE(s) » F=1 then segment fault
attributes of the segment, respectively. Figure 2.12 illustrates how the physical
address is determined from the segmented virtual address, which consists of the
segment number s and the index i of the word within the segment. Segments
may be shared by several processes, as shown in Figure 2.13 for two processes in
separate processors, Notice that the relative positions of a shared segment need
not be the same in different segment tables.
When a segment ¢s> is initially referenced in a process, its segment number s
is not established. The segment must therefore be made known to the process by
providing a corresponding segment number as an entry in the ST to be used in
subsequent references. Using the segment name (directory pathname) <s> as a
key, a global table, called the active segment table (AST), which is shared by all
processes, is searched to determine whether the segment is active in memory. If it is,
the absolute base address of the segment and its attributes are returned and an
entry is made in the AST to indicate that the process is using this segment. If <s>
does not exist in AST, a file directory search is initiated to retrieve the segment and
its attributes. The returned absolute base address of the segment and its attributes
are entered into the AST and a newly created node of the ST. A segment number s,
which is the displacement of the node, is assigned by the operating system from the
set of unused segment numbers for that process.
MEMORY AND INPUT-OUTPUT SUBSYSTEMS 75
STBR, STBR,
Address] £ Ad
<d>
Associated with each process is a known segment table (KST), which contains
entries on a set of segments known to the process. Each entry in the table contains
a segment name~segment number pair. This is used to obtain the segment number
when subsequent references are made to the segment name in the process. The
address mapping mechanism shown for the segmented system involves a method
of indirection to access each word that is referenced. This inefficiency may be re-
solved by the use of associative mapping techniques, as discussed in the paged
system.
When a segment is copied from disk to memory, it is moved in its entirety.
This is also true when the segment is relocated in memory. An appropriate size of
contiguous data area must be found and allocated to that segment before the trans-
fer operation is initiated. It is not often that a contiguous block of memory is
found to fit the segment. In many cases, there are unused fragments of space, called
holes, each of which may not always fit the segment to be placed. Various placement
algorithms have been proposed. We present four of them.
Let sj, 52...., 8, be the sizes of the n holes available in memory, and let s be
the size of the segment to be placed. If the holes are listed in order of increasing size,
5) S53 S--- <5,, then the besz fit algorithm finds the smallest i, such thats < 5,.
Similarly, the worst ft algorithm can be defined if the holes are listed in order of
decreasing size. This algorithm places the segment in the first hole and links the
hole formed by the remaining space into the appropriate position in the list. In a
third algorithm, called the first fit, the hole table lists holes in order of increasing
initial address. The hole with the smallest i, such that s < s,, is selected.
The fourth algorithm is the duddy system. In this case we assume that the seg-
ment size is s = 2" for some k < n. This policy maintains a hole lists, one for each
WWW.Gitmgurgaon.blogspot.com
76 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
size hole, 2', 2?,..., 2". A hole may be removed from the (f + 1)th list by splitting
it into half, thereby creating a pair of “ buddies” of size 2', which are entered in the
i list. Conversely, a pair of buddies may be removed from the / list, coalesced, and
the new hole entered in the (i + 1)th list. With this scheme, we can develop an
algorithm to find a hole of size 2*.
The best fit algorithm appears to minimize the wastage in each hole it selects,
since it selects the smallest hole that will fit the segment to be placed. However, the
worst fit algorithm is based on the philosophy that the allocation of a larger hole
will probably leave a hole large enough to be useful in the near future. It also
assuines that making an allocation from a small hole will leave an even smaller
hole, which will probably be useless without coalescing with other holes. The
first fit and the buddy system are the most efficient algorithms.
In most cases, a time-consuming memory compaction is used to collect frag-
ments of unused space into one contiguous block for the appropriate segment
size. Moreover, since in the process of compaction segments in use are moved,
the corresponding segment table entries must be modified. The unoccupied holes
of various sizes which tend to appear between successive segments give rise to a
phenomenon called external fragmentation. This causes memory management
inefficiencies. Moreover, a whole segment may be brought into memory when only
a small fraction of its address space will be referenced during the lifetime of the
process, resulting in superfluity. These problems can be alleviated by combining
segmentation with paging. It should be noted that table fragmentation also occurs
in segmented systems.
Virtual address
5 ip hy
PT of segment 5 STE(s)
~t 1 # Address [ewx| ee
Page i,of
segment 5 PTE(i,) |
@ Address Le
-
pe
—s=
WORD(s, ins io
RWX| e e ° e -=
® e e e Preece ~ >
ps a2
i]
ma &
wn @
p
§
° Se &
gE
Bs
i. . auo
a
i, mh
Figure 2.15 Associative map for paged segments. programs are divided into segments and each segment
is divided into pages-> paged segments
WWW.Gitmgurgaon.blogspot.com
78 COMPUTER ARCHITECTURE ANID PARALLEL PROCESSING
in the TLB is chosen for replacement with the new segment number, page number,
and page address triple.
If the segment number is not known when using segmented name space, a
segment fault occurs, which invokes a procedure to make the segment known and
causes the processor to perform a context switch to another process. The segment
is made known by searching for the segment pathname in the AST. If it exists, the
segment is in memory and is being used by an active process. Hence, its PT location
in shared memory is known and is obtained from the AST entry. An unused
segment number for the faulting process is obtained and an entry is made in the
process control block of the process to prepare the process for subsequent execution.
However, if the segment is not known to any active process, a directory search is
performed to find the location of the segment in the file memory. A free PT and
an unused AST entry are obtained. These segment attributes are copied to the
PT and a pointer is established in the AST entry to point to the PT. A page of
the segment is then copied to the memory and the appropriate entries are made
in the ST of the process, as described previously.
In a virtual memory system, a page-fault interrupt typically violates the
assumption made about interrupts on a processor. While interrupts occur asyn-
chronously, they are constrained to be serviced at the end of an instruction cycle.
However, a page fault interrupt which occurs within an instruction cycle must be
serviced before the instruction cycle-can be completed. This problem occurs, for
example, when an instruction encoding crosses a page boundary or when a refer-
efice is made to an operand which is outside the page. Hence, at the point of inter-
rupt, the page-fault handler must determine how far the instruction has progressed
and what it must do to restart or continue the instruction cycle.
In some systems, marty instructions can be restarted simply by backing up the
program counter and reexecuting the instruction from the beginning. However,
the partial execution of other instructions may have already made irrevocable
changes to the registers and memory states. Such instructions must be restarted
from the point of interruption. In general, this requires the saving of many “atomic”
processor states, such as machine cycles, or the prohibition of any instructions
which cannot be “backed out” of.
There are certain problems involved with using an associative map (TLB) in
a multiprogrammed processor, The size of a TLB ts fixed and hence.can contain
only a limited number of entries. If a segment number—page number pair ¢s, i,>
does not exist in the TLB at the time of reference, it is accessed from the PT of the
segment in me.nory and is used to replace an entry in the TLB. When a page fault
or segment fault occurs because the page or segment is not in memory, the processor
suspends the faulting process and switches to a ready-to-run process.
The new process creates its own address space, which is different from the
suspended process. Therefore, all the entries in the TLB map become invalid. The
mapping mechanism must ensure that no old TLB entry is used in the new address
space, as this may result in incorrect access to physical words of the suspended
process and thereby create a hole in the protection mechanism. This problem can
be solved by the context switch mechanism, which can invalidate all entries of the
MEMORY AND INPUT-OUTPUT SUBSYSTEMS 79
TLB by the use of a special instruction, as was done in the original GE-645 Multics
system. This technique degrades the performance of the system since the new pro-
cess goes through initial slow indirect accesses to retrieve the <s, i,> entries from
the STs and PTs of the process.
Further problems exist when all TLB entries are invalidated. As the new pro-
cess slowly fills up the TLB map with the valid entries, it may be interrupted or
page faulted again, which will cause the TLB entries to be invalidated once more.
Processes may continuously undergo the TLB reload cost and severely degrade
the system performance. This problem can be solved by introducing in each TLB
entry a process identification field which contains a short encoding of the process
identification number. This technique, as implemented in the IBM 370/168,
permits the associative map to contain more than one process entry (address
space). However, only the entry that matches the currently running process is
used. A process may therefore be restarted with part of its mapping entries in the
TLB, thereby reducing the reload cost.
CHOICE OF PAGE SIZE NOT NEEDED AND ANALYSIS IS NOT NEEDED
Choice of page size In purely segmented memory systems, we found that external
fragmentation is a potential cause of memory under-utilization. The external
fragmentation can theoretically be avoided by paging. However, paged segments
reduce the utilization of memory by using additional storage space for segment and
page tables (table fragmentation) and by rounding up the memory requirements
for a segment to an integral number of pages (internal fragmentation). Ifz words
is the size ofa page ands is the segment size in words, the number of pages allocated
to the segment is
nts, Zz) = |
Sd
where c is a constant.
The fraction w of memory wasted because of paging in a segmented system is
we eT s (2.7)
The expectation of the numerator of w is (c + Z)E[n(s, z)] — Els] and that of the
denominator is E[s]. If we denote the ratio of the expectations by #, we have
WWW.Gitmgurgaon.blogspot.com
86 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
By setting di/dz = 0, we find that the optimum page size z, and the minimum
fractional wasted space Wy, aré zy = ./2cS and Wy = Jf 2ef5 + c/25. In general, the
fractional wasted space decreases when the segments (and pages) increase in size.
This is in contrast to the requirements for contiguous segments, which should be
small in size to reduce external fragmentation.
The choice of the page size z is a critical parameter which affects the perform-
ance of a virtual memory system. Assuming that § = 8192 bytes and ¢ = 4,
Z9 = 256 bytes. This seems rather small when it is known that typical values of
z are 256 to 2048 bytes. In practice, the choice ofz depends mostly on the efficiency
of the paging device.
In this section, we discuss the various models and classification of memory manage-
ment schemes. Basically, two policies, fixed and variable partitioning, are identified
to manage the allocation of memory pages to active processes. In the fixed alloca-
tion scheme, the partition of memory allocated to an active process is fixed during
the lifetime of the process. The variable allocation scheme permits the partition to
vary dynamically during the lifetime of the process and according to the memory
requirements of the active process, Various paging algorithms are discussed for
both the fixed and variable partitioning policies.
borrowed from their work. Let us denote by A = {P,, P,,..., P,} the set of
active processes during the interval in which the level of multiprogramming is
fixed [d = d(1)]. To each P; at time ¢ is associated its resident set Z(t) (which is the
set of the page frames of the process present in memory) containing zr) > 1 pages.
In general, the resident sets Z,(t) overlap because of the sharing that takes place
among active processes. The management configuration is represented by a parti-
tion vector Z(t) = [Z,(2),..., ZO]. Hence, the size vector 2(1) = [2(9, ..., 2,(0)].
The total set of page frames used.by the d processes is
Let z,(f) represent the number of pages shared by processes P; and P; such that
P, # P, and let z,,(t) represent the number of pages shared by processes i, j, and
k, at ¢. That is, ignoring the rs, we write
2, = MZ;), zy = MZ, Z)), 21, = WZ, 02,0 2),
where a(z) is the number of pages in a set z. The sum of all z(t)’s with r subscripts
represents the total number of pages shared by r processes at time t. We will denote
this by N,(t). Hence, N,(t) = ¥ a(t), N2() = 3 2,0), N3() = 3 2,(0), where
lsi<j<k<--. <d. Note that N.@) has terms and the last sum, N,(t),
r ”
reduces to a single term that indicates the number of pages shared by all the d
processes. If M is the total number of page frames available for allocation in
memory, then Feller (1970) found
d
XE (-) IN) s M
ral
at every time instant t. The pages of main memory which are unused by any active
process is called the resource memory and is denoted by
d
RQ) = M = ¥ (1 ING) (2.10)
re]
WWW.Gitmgurgaon.blogspot.com
82 COMPUTER ARCHITECTURE ANID PARALLEL PROCESSING
since partition changes occur as infrequently as possible; that is, when the set of
active processes changes. This advantage can be very easily offset (even if the mem-
ory requirements of each process can be predicted prior to processing) when one
accounts for the changing locality in a process. Consider the behavior of a fixed
partition when each process of the set of active processes (P,,..., P,) has a large
variance in locality set size as the time varies.
Since the partition is fixed, there is no way to reallocate page frames from
Z, to Z,at a time when P's locality is smaller than z, and P,’s locality is larger than
z,, even though such a reallocation would not degrade the performance of P; but
would improve the performance of P,. This effect has been analyzed by comparing
fixed versus variable memory-partitioning strategies in terms of the probability
that the memory space a process demands exceeds the allocated space. A study
suggests that the variable-partitioning strategy is much better than the fixed-
partitioning strategy because there is a severe loss of memory utilization for
processes that exhibit a wide variance of locality size.
In addition to the fixed- and variable-partitioning strategies, a memory
policy can either be global or local. A local policy involves only the resident set
of the faulting process; the global policy considers the history of the resident sets
of all active processes in making a decision.
We describe the behavior of programs being executed in terms of certain
parameters which define various memory management policies for fixed- and
variable-partitioning strategies. Recall that a program in execution generates a
sequence of references (known as an address trace) to information in its virtual
address space. The ith process’s behavior is described in terms of its reference
string, which is a sequence:
R(T) = r(1)r (2) (7)
in which r(k) is the number of the page containing the virtual address references
of the process P, at time &, where k = 1, 2,..., Tmeasures the execution time or
virtual time. The set of pages that P; has in main memory just before the Ath
reference is denoted by Z{k — 1), and its size (in pages) by z,(k — 1). A page fault
occurs at virtual time & if r(A) is not in Zk — 1).
There are basically two memory-fetching policies used in fetching the pages
of a process when a page fault occurs, demand prefetching and demand fetching. In
demand prefetching, a number of pages (including the faulting page) of the process
are fetched in anticipation of the process's future page requirements. Jn general,
prefetching can, if properly designed, improve performance by permitting an
overlap between the execution and the fetching of the same program, Prefetching
techniques will be discussed later. In demand fetching, only the page referenced is
fetched on a miss. Demand fetching can result in an increase in superfluity. Under
the assumption of demand fetching, 7,(k) is the same as Z,(k — 1) plus ¢,(k), less
any pages {y;} of Z(k — 1) replaced by the mem=ry policy. Hence, using set
notations,
Z{k) = ZAk — WV) + trl} — ty
( 2.11 }
zf{k) < 2{k -1) +1
MEMORY AND INPUT-OUTPUT SUBSYSTEMS 83
The optimal memory policy for IRM replaces the page with the smallest value of
a; among the pages present in the resident set. The IRM is the simplest way of
accounting for the nonlinearities observed in the swapping curves of real programs.
Note that an assumption of completely random references would imply linear
swapping curves. The IRM is nat a good madel of overall program behavior.
It has been shown that the LRUSM is a result of the LRU memory policy.
This model uses an “LRU stack,” which is a vector that orders the pages by
decreasing recency of reference. Just after referencing r(r), the first position will
contain r(t). A stack distance g(r) is associated with the reference r(1). g(t) is
the position of r(r) in the stack just after r(t — 1). The LRU stack has the property
that (2) the LRU policy’s resident set of capacity s pages always contains the
first s elements of the stack, and (4) the missing-page rate is the frequency of
occurrences of the event g(t) > s. The LRUSM assumes that the distances are
independent random variables with a common stationary distribution. Thus the
probability of referencing a page in stack at distance ; is
fb, 2b, >--. 2b; =--- 2 b,, then the LRU policy is optimal beth in fixed-
space and variable-space strategies. The LRUSM is slightly better than the IRM.
WWW.Gitmgurgaon.blogspot.com
84 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
The above models do not adequately capture the essence of program behavior,
which demands the changing need for memory from one phase to another. A
realistic model must account for multiple program phases over locality sets of
significantly different sizes and must not rule out strong correlations between
distant phases. Some phase-transition models of program behavior have been
developed which are more realistic than the last two models. Briefly, the program
model consists of a macromodel and a micromodel. The macromodel is a semi-
Markov chain whose “states” are mutually disjoint locality sets and whose “holding
times” are phases, The macromodel is used to generate a sequence of locality-set-
holding-time pairs (S, T). Fhe micromodel is used to generate a reference substring
of length T over the pages of locality set 5. For the micromodel, the IRM or
LRUSM may be used.
The page-fault rate function for process P;, denoted by f,{A, 5), is the expected
number of page faults generated per unit of virtual time when a given reference
string R,; is processed by memory policy A, subject to the memory space con-
straints s. The page-fault rate function is one of the most important parameters in
the study of memory management. Most studies performed indicate that this
function is relatively independent of R;. For the fixed-memory allocation, the
space constraints are interpreted to mean that the resident set sizes must satisfy
z(k) < s for all virtual times x. For the case of variable-space allocation, the space
constraint s is interpreted as the average resident set size of process P,, that is
1 d
ses } 2(k) (2.12)
ae
for a system with d active processes. In both allocation schemes, the page-fauit rate
for the total page is
d
f= ¥ fs) (2.13)
t=]
where Nis the number of pages shared by r processes and M is the total number of
page frames in the main memory.
Another measure of the page-fault rate is the lifetime function e,2,). which
gives the mean execution interval (in virtual time) between successive page faults
for process P; when it has z, of its pages in shared memory. The derivation of this
function assumes a given memory policy. A knee of a lifetime curve is a point at
which e((z,)/z; is locally maximum, The primary knee is the global maximum of
e(z,)/z;. A typical lifetime curve is shown in Figure 2.16.
Several empirical models of the lifetime curve have been proposed. One is the
Belady model:
ez) = a+ zh (2.14)
MEMORY AND INPUT-OUTPUT SUBSYSTEMS 85
Time/fault ’
4 at é
& e
ae e,(2,)
2,(0) pam a ee ee
Primary knee
Mean lifetime
i
F
E
k
k
F
‘
'
t
'
t
1 a
where z, is the mean resident set size, a is a constant, and k is normally between 1.5
and 3. In general, a and k depend on the program characteristics. This model is often
a reasonable approximation of the portion of the lifetime curve below the primary
knee, but it is otherwise poor.
A second model is the Chamberlin model:
2b;
= Ty (eae (2.15)
WWW.Gitmgurgaon.blogspot.com
86 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
transition from the concave to the convex region occurs at 2; = ¢/,/3, and that
in z, = ¢, the curvature in the convex region is maximum. Therefore, e; could be
considered a reasonable approximation to the memory demand of P,.
Although this model has a knee, it is not a very good match for real programs.
It is generally quite easy to measure lifetime curves from real data and such
measurements are generally more reliable than estimates from models. If the page
transfer time is S$ then the page-fault rate for process P, is
ot
fi eb S
(2.16)
This equation can be used to derive an optimization problem, which can then be
solved to obtain optimum memory space allocation in a multiprogramming system.
Another measure that is often used is the space-time product of an active pro-
cess, This product is the integral ofa program’s resident set size over the time Tit
is running or waiting for a missing page to be swapped into shared memory. Let
z(t) be the size of the resident set at time ¢, t; be the time of the ith page
fault (i = 1,,.., K), and D be the mean swapping delay, The space-time product is
T K
ST = ¥en +D ¥ 20) (2.17)
r=] i=1
If s is the mean resident set size, we can approximate ST by noting that the first
sum becomes sT, If we approximate the second sum by sK and note that sK =
s(K/T)T = sf(s)T, where f(s) is the missing page rate, the space-time product is
approximated by
ST = Ts{t + Df(s)] (2.18)
Although Eq. 2.18 is simple to compute, the approximation is not very reliable,
Note that sf(s) = s/es) is minimum at the primary knee of the lifetime curve.
If D is large, choosing s at this knee will approximately minimize the space-time
product.
p—TCL IT Page 10
. 4
) CPU
t—LETT!
.
Page vo
t——{]
|[ [File 170
e
CPU e
¢
File 1/O
!
Process queue
and
Ah
lead control
New Terminated
processes processes
Figare 2.17 A multiprogrammed multiprocessing virtual memory system model.
WWW.Gitmgurgaon.blogspot.com
83 COMPUTER ARCHITECTURE ANID PARALLEL PROCESSING
the processors, memory and the file memory, and the passive network which con-
tains a process queue and the policies for admiting new processes to active status.
A process is considered active if it is in the active network, where it is eligible to
receive processing and have pages in main memory. Each active process is waiting
or in service at one of the three classes of resources in the active network. It waits
at the file I/O class whenever it requires a segment to be transferred between main
memory and the disk memory. An active process waits at the paging device modules
whenever it requires a page to be transferred between main memory and a paging
device, such as a drum or fixed-head disk. Otherwise it is in the CPU station.
The box labeled “ Process queue” contains a set of enabled (passive) processes,
a decision policy for activating them, and a load-control mechanism for controlling
d(t). Notice that each CPU node is usually considered to have a cache whose
action is transparent, i.e.,a cache miss does not necessitate a context switch. When
a process either issues a file 1/O request or creates a page fault, it will release its
processor to another ready process and wait for the completion of the I/O transac-
tion. Such a model, as depicted in the Figure 2.17, is called a closed queueing net-
work model with d processes, where d is the steady-state degree of multipro-
gramming.
In addition to the DOM, another parameter used in the memory management
model is the average total time used to service each process which requested paging
device i. This time, which is denoted D,, is the demand per process for the ith
device. For each device, D; is the product of the mean number of requests per
process for that device and the mean time to service one request. For the paging
device, the demand per process grows with d because higher DOMs imply smaller
resident sets and higher rates of paging. For devices such as CPU and I/O, the
demand per process does not depend on d. The demand for each CPU is the mean
execution time E of a process. The average number of page faults per process is
E/L(d), where L(d) denotes the lifetime or mean time between faults fora DOM
of d.
The demand for the paging device is D; = ES/L(d), where S is the mean time
to service one page transfer (exclusive of queueing delays). If the function L(q) is
not available from a direct measurement of the system, it can be estimated from
the lifetime curve of a typical program. One method to estimate L(d) is to set
L(d) = e{M/d), where ex) is the mean time between page faults for a typical
program when the given memory policy produces a mean resident set size of x
pages and M is the number of available pages of main memory.
The queueing network model of Figure 2.17 can be used to estimate the system’s
throughput X,, which is the number of processes completed per second. The
throughput is proportional to the average utilization of the CPUs, U, and is given
by X,&. Figure 2.18 illustrates typical CPU utilization curves as a function of
the DOM. The curve rises toward CPU saturation as the degree of multipro-
gramming ¢ increases, but is eventually depressed by the ratio L{d)/S, the utilization
of the saturated paging device. As suggested in Figure 2.184, the DOM d,, at which
L(d) = 3, is slightly better than the optimum dy. Note that beyond do, the system
begins to thrash.
MEMORY AND [INPUT-OUTPUT SUBSYSTEMS 89
Utilization
4
‘
‘
Le ee \
eee CPU saturation
° i
\ L(d) Paging
5 saturation
Degree of multiprogramming
Utilizadon
Degree of multiprogramming
WWW.Gitmgurgaon.blogspot.com
90 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
CPU
utilization
4
i
i *, Without Limit
‘ se, in memory
iii “*. queue
vo.
i ara
: Bee are
i‘
i
ay dnax
given time does not exceed d,,.,, all are active; otherwise, the excess processes are
held inactive in a memory queue. The limit effect of the memory queue is illustrated
in Figure 2.19, In practice, the optimum DOM varies with the work load, therefore
an adaptive control is required to adjust d,,,..
The load control is accomplished by a component of a dispatcher, which is
part of the operating system. The purpose of the dispatcher is to control the
scheduling of processes and allocation of main memory so that the throughput
for each work load is maximum. The dispatcher consists of three components: the
scheduler, the memory policy and the load controller. The scheduler determines
the composition of the active set of processes, It does this by activating processes
from the passive process queue into the active set. The memory policy determines
a resident set for each active process and, as we have seen, the load controller
adjusts the limit d,,,, on the degree of multiprogramming. All memory policies
manage a pool of unused space in main memory. The pool contains the pages of
resident sets of recently deactivated processes. Under a fixed-space policy, the
pool also contains pages which have recently left the resident sets of active pro-
cesses. By comparing the measured memory demand of a process with the pocl’s
size, the scheduler avoids activating a process if the activation would overload the
system.
2. Belady’s optimal algorithm (MIN)-- At page fault replaces the page in Z(t)
with the largest forward distance:
OZi(t)) = y if and only if d,(y) = max [4,00)] (2,22)
xez(t)
WWW.Gitmgurgaon.blogspot.com
92 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Since the LRU policy is one of the most popular algorithms, we will describe
its implementation. Associated with this policy is a dynamic list known as the
LRU stack, which arranges the referenced pages from top to bottom by decreasing
order of recency of reference. At a page replacement time, the LRU policy chooses
the lowest-ranked page in the stack, therefore, the contents of an s-page resident
set must always be the pages occupying the first s stack positions, When a page is
referenced, the stack is updated by moving the referenced page to the top and
pushing down the intervening pages by one place. The position at which the refer-
enced page was found before being promoted to the top is called the stack distance.
A page fault occurs in an s-page resident set at a given reference if and only if the
stack distance of that reference exceeds s. In the fixed partitioning strategy, each
active process has its own LRU stack.
Algorithms such as LRU, LFU, LIFO, FIFO, and RAND, which are called
nonlookahead algorithms, are realizable. MIN is a lookahead page-replacement
algorithm and is not realizable, but provides a benchmark on which we can measure
the relative performance of the realizable algorithms. Figure 2.20 illustrates the
typical relative page-fault rates for various paging algorithms. The page-fault
rate f(A, s) for a given algorithm A and resident size constraint s can be computed
from the reference string R.
Let SA, s, R) represent the set of pages in the resident set-size constraint s
at time instant} when processing a reference string R. A natural expectation is
that if the size constraint s increases, the following jaclusion property would hold:
S(A, 9)
Nonlookahead algorithms
Figure 2.20 Page fault rates of realizable and nonrealizable algorithms for various resident set sizes.
the inclusion property. For example, consider the processing of the reference
string R = 12314, using the FIFO algorithm, when the address space of the process
is the set M = {1, 2, 3, 4} for two resident size constraints, s = 2 and s = 3. Below
we show the sequence of § states generated as a result of the processing of the string
R. In this illustration, an asterisk (+) after a reference indicates that no page
fault occurred, otherwise, a page fault did occur.
S, S, 8; Sy Ss
R=1 2 3 1 4
11 2 3 1) ._,
2 3 1 4f
1 ot 1 i 2
2 2 2 3) s=3
3.3 4
Notice that $,(FIFO, 2, R) ¢ S,(FIFO, 3, R) and, hence, does not satisfy the incli-
sion property. The aormalized page-fault rate can be obtained from the expression
N(A, s, R)
FCA, A, 3)
8) = RI 2.24
(2.24)
WWW.Gitmgurgaon.blogspot.com
94 COMPUTER ARCHITECTURE ANID PARALLEL PROCESSING
where N(A, s, R) is the number of page faults which occurred in the processing of
the reference string R using algorithm A and a resident set-size constraint of s.
| R|{ is the cardinality ofR or the number of references in R. For the example above,
{CFIFO, 2) = 1.0 and f(FIFO, 3) = 0.8. Algorithms which satisfy the inclusion
property are called stack algorithms.
Although this method of derivation of the page-fault rate for a given reference
is adequate, it does not account for the mechanisms by which programs generate
reference strings. Moreover, the procedures do not readily extend to the analysis
of variable-space policies which use the locality of reference model. We will now
consider the paging algorithms for variable-space partitioning strategy using a
global policy.
Several important algorithms for implementing variable-space partitioning
strategies have been used. One approach to the memory management commonly
used extends the idea of a fixed-space replacement policy simply by applying the
replacement rule to the entire contents of main memory, without identifying
which process is using a given page. Examples of this approach are:
1. Glebal LRU — which arranges all the pages of the active processes into a single
global LRU stack. Whenever an active process runs, it will reference its locality
set pages and move them to the top of the global LRU stack.
2. Global FIFO--which arranges all the pages of the active programs into a single
global FIFO list.
hardware to | when the page is referenced. Whenever a page fault occurs, the
memory policy advances the current-position pointer around the list clearing set
usage bits and stopping at the first page whose usage bit is already cleared to 0:
this page is selected for replacement. This paging algorithm was used in the
Multics system. The above variable-space allocation policies do not attempt to
identify locality sets and protect them from preemption.
Another example of the variable memory partitioning is the werking-set
(WS) algorithm, which takes into account the varying memory requirements during
the execution ofa process. Denning (1968) introduced the concept of working set to
describe program behavior in virtual memory environments. The working set
W(t, 0) is used to denote an estimator of a locality set. W’(r, @) of a process at
time r is defined as the set of distinct pages which are referenced during the execu-
tion of the process over the interval (r ~ @, 1), where @ is the window size. The
working-set size w(t, #) is the number of pages (cardinality) of the set W(t, 0).
This algorithm retains in memory exactly those pages of each process that
have been referenced in the preceding # seconds of process (virtual) time. If an
insufficient number of page frames are available, then a process is deactivated
in order to provide additional page frames. Notice that the working-set policy
is very similar to the LRU policy in that the working-set algorithm specifies the
removal of the LRU page when that page has not been used for the preceding 0
time units, whereas the LRU algorithm specifies the removal of the sth least
recently used page when a page fault occurs in a memory of capacity s.
The success of the working-set algorithm is based on the observed fact that a
process executes in a succession of localities; that is, for some period of time the
process uses only a subset of its pages and with this set of pages in memory, the
program will execute efficiently. This is because, at various times, the number of
pages used in the preceding @ seconds (for some appropriate 0) is considered to be
a better predictor than simply the set of K (for some K) pages most recently used.
Thus for example, a compiler may need only 25 pages to execute efficiently during
parsing, but may need 50 during code generation. A working set with the correct
choice of the parameter @ would adapt well to this situation, whereas a constant
K over both phases of the compiler would either use excess space in the syntax
phase or insufficient space in the code-generation phase. The working-set paging
algorithm, although efficient, is difficult to implement, however.
Yet another variable-partitioning strategy which can use local or global
policy is the page-faule frequency (PFF) replacement algorithm. The PFF also
attempts to follow variations in localities when allocating memory space to
processes. This policy is implemented using hardware usage bits and an interval
timer and is invoked only at the time ofa page fault. Let ¢’ and ¢ fort > r denote
two successive (virtual) times at which page faults occur in a given process. Also,
let R(t, 0) denote the PFF resident set just after time ¢, given that the control
parameter of PFF has the value @. Then
WWW.Gitmgurgaon.blogspot.com
96 COMPUTER ARCHITECTURE ANID PARALLEL PROCESSING
where r(r) is the page referenced at time ¢ (and found missing from the resident
set). The interfault interval t — ¢’ is used as a working-set window and the param-
eter () acts as a threshold to guard against underestimating the working set in
case of a short interfault interval. Hence, if the interval is too short, the resident
set is augmented by adding the fault page r(t). The usage bits, which are reset at
each page fault, are used to determine the resident set if the timer reveals that the
interfault interval exceeds the threshold. Note that 1/0 can be interpreted as the
maximum tolerable frequency of page faults.
Various experimental studies have shown that WS and PFF, when properly
“tuned” by good choices of their control parameters, perform nearly the same as
each other and considerably better than LRU. The PFF may display anomalies
for certain programs since it does not satisfy the inclusion property. Since global
memory policies make no distinctions among programs, their load controls
have no dynamically adjustable parameters. However, these controls cannot
ensure that each active program is allocated a space-time minimizing resident set.
Local memory policies such as WS and PFF offer a much finer level of control
and are capable of much better performance than global policies. These policies,
however, present the problem of selecting a proper value of the control parameter
0 for each program.
Finally, we present an ideal variable-space memory policy which could be
local or global. This is the optimal variable replacement algorithm called VMIN,
The VMIN generates the least possible fault rate for each value of mean resident-
set size. At each reference r(r) = i, VWMIN looks ahead: if the next reference to
page i occurs in the interval (1, ¢ + 6), VMIN keeps i in the resident set until that
reference; otherwise, VMIN removes ij immediately after the current reference.
Page i can be reclaimed later when needed by a fault. In this case, # serves as a
window for lookahead, analogous to its use by WS as a window for lookbehind.
VMIN anticipates a transition into a new phase by removing each old page from
residence after its last reference prior to the transition, This results in a behavior
depicted in Figure 2.21. In contrast, WS retains each segment for as long as @ time
units after the transition. VMIN and WS generate exactly the same sequence of
page faults for a given reference string. The suboptimality of WS results from resi-
dent set “ overshoot” at interphase transitions, as shown in Figure 2.21. However,
since VMIN is a lookahead algorithm, it is not practical.
Resident
sel size
(pages)
+
‘
ws
athe
Clipped
Irremovable sera
overshoot
i t+é
Virtual time
WWW.Gitmgurgaon.blogspot.com
98 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
prefetched page is referenced, then the prefetched page is replaced. In all other
respects, this stack is maintained like an LRU stack.
This algorithm, when applied to database systems and vector operations,
performs adequately where a high degree of sequentiality is present. It has been
observed that common programs in execution do not possess adequate spatial
locality unless the page size is rather small. As we shall see in Section 2.4, systems
with caches employ a small block size so the OBL algorithm may be used to ad-
vantage in them. Also, since the units of information transfer from memory to the
processor are quite small, and the instruction stream tends to exhibit a high degree
of sequentiality, sequential prefetching of instructions into an instructor buffer
is commonplace. Sequentiality may be induced in the data streams for vector in-
structions. Prefetching algorithms must be designed carefully so as not to nullify
the potential gain in the reduction of page faults or misses by a disproportionate
increase in the number of fetches.
Cache memories are high-speed buffers which are inserted between the processors
and main memory to capture those portions of the contents of main memory which
are currently in use. They can also be inserted between main memory and mass
storage. Since cache memories are typically five to 10 times faster than main
memory, they can reduce the effective memory access time if carefully designed
and implemented. This section discusses the characteristics of cache memories
and the various cache management strategies. Four cache organizations —~- direct,
fully associative, set associative, and sector mappings—are discussed. Cache
replacement policies are used to decide what cache block to replace when a new
block is to be brought into the cache.
Operation of a cache The cache memory generally consists of two parts, the cache
directory (CD) and the random-access memory (RAM). The memory portion is
partitioned into a number of equal-sized blocks called block frames. The directory,
cache has address data pairsMEMORY AND INPUT-OUTPUT SUBSYSTEMS 99
Design aspects In general, the design ofa cache is subject to different constraints
and trade-offs than that of main memory. One of the important parameters in the
WWW.Gitmgurgaon.blogspot.com
100 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Virtual address
Byte
ee _ | within
page
| Search FLB
'
me Bose | tn
Byte
¥ ! TESS | block
Select TLB Send virtual Update (oe errr
entry for dynamien dvess replacement 1
replacement translator (DAT) status of FLB) Search Update
cache replacement
+ directory (CD) status of CD
Translate 4
virtual address
6o real address
yes
Store block
in cache
i
Send byte/
word to
Processor
design ofa cache memory is the placement policy, which establishes the correspon-
dence between the main memory block and those in the cache. Other organizational
parameters are the fetch policy, the replacement policy, the main memory update
policy, homogeneity, the addressing scheme, cache and block sizes, and the cache
bandwidth. The main memory update policy decides the time the information in
memory is to be updated once the processor has requested a modification of the
information. The fetch policy denotes how, when, and what information is to be
fetched into the cache. The cache could be partitioned into several independent
caches to segregate various types of references. An unpartitioned cache is said
to be homogeneous, The cache could be multiported so that two or more requests
can be made to the cache concurrenily. In this case, a priority algorithm must
exist to select one of the arbitrating requests. Furthermore, the cache accesses can
be pipelined as in many mainframes, so that more than one cache access can be in
progress concurrently. For example, four cache requests can each be in a unique
phase of completion if the cache cycle is partitioned into four segments as follows:
priority selection, TLB access, cache access, and replacement status update.
Cache bandwidth The cache bandwidth is the rate at which information can be
transferred from or to the cache, The bandwidth must be sufficient to support the
rate of instruction execution and 1/O. The bandwidth can be improved by increasing
the data-path width, interleaving the cache for concurrency, and decreasing the
access time. The cache bus width affects the cost, reliability, and throughput of the
system. An increase in bus width increases the access time because of packaging
problems and additional] gate delays because of line drivers and receivers. It also
diminishes the signal-switching noise immunity. However, the wider the bus, the
faster the data transfer. The number of fetches to main memory required to load a
block of a given size depends on the bus width. Interleaving the cache can keep
the bus width low while maintaining the bandwidth.
WWW.Gitmgurgaon.blogspot.com
102 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
the frequency of task switches is reduced and a task, once assigned to the processor,
will get a chance to reach a steady-state (“‘ warm-start’’) miss ratio before the next
context switch, Another solution is to save the process’s context in main memory on
a context switch and reload it en masse the next time it is assigned to the processor,
Data consistency The problem of having several different copies of the same block
in a system is referred to as the cache coherence or data consistency problem. This
problem exists in a uniprocessor with cache when the processor can be active
after modifying a word in the cache and before the copy in memory has been
updated. The effect of the main memory update policy on data consistency will
be discussed in Section 2.4.3. If the processor is the only unit to access memory,
then the coherence problem is a mere theoretical observation without practical
bearing on the correctness of the program execution. However, practical systems
contain I/O units which require access to the memory. The method in which the
VO unit accesses the memory in a system with cache may create consistency
problems, as will be seen in Section 2.5. In a multiple processor system with
caches, the data consistency problems may also exist between caches. Solutions
to such coherence problems will be discussed in detail in Chapter 7.
processor always asks the contents of address locations
2.4.2 Cache Memory Organizations OCT 20
The cache is usually designed to be user-transparent. Therefore, in order to locate
an item in the cache, it is necessary to have some function which maps the main
memory address into a cache location. For uniformity of reference, both cache
and main memory (MM) are divided into equal-sized units, called blocks in the
memory and block frames in the cache. The placement policy determines the
mapping function from the main memory address to the cache location.
256K= 2^18; 2K= 2^!1 address data pairs is stored in cache memory
Placement policies There are basically four placement policies: direct, fully
associative, set associative, and secter mappings. In discussing the mapping
functions, we will consider a specific running example in which each processor’s
cache is of size 2K (2048) words with 16 words per block. Thus the cache has
128 block frames. Let the main memory have a capacity of 256K words or 16,384
blocks. The physical address is representable in 18 bits. 2048/16= 128 block frames
therefore main memory address is 18 bits size and cache memory address is 11 bits size
Direct mapping This is the simplest of all organizations. In this scheme, block /
of the memory maps into the block frame i modulo 128 of the cache. The memory
address consists of three fields: the tag, block, and word fields, as depicted in
Figure 2.23. Each block frame has its own specific tag associated with it. When a
block of memory exists in a block frame in the cache, the tag associated with that
frame contains the high-order 7 bits of the MM address of that block, When a
physical memory address is generated for a memory reference the 7-bit block
address field is used to address the corresponding block frame. The 7-bit tag
address field is compared with the tag in the cache block frame. If there is a match,
the information in the block frame is accessed by using the 4-bit word address field.
main memory is divided into equal sized units called blocks &
cache memory is divided into equal sized units called block frame
Therefore, if you fetch one block from main memory it ll be loaded into one
block frame in cache memory. One block of main memory means consecutive(series) of
address locations, that consecutive(series) of address locations is loaded in cache memory
MEMORY AND INPUT-QUTPUT SUBSYSTEMS 103
cr =
= k
BON aarti Block 256
OO ie el ock 7
Block 16383
2^7 = 128
a a ~~ A yr
Tag Block Word
2^4=16
Figure 2.23 Direct mapping cache organization.
WWW.Gitmgurgaon.blogspot.com
1@4 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
—! 14 bits ae Block 0
: Block 1
Tag :
* .
Block 0
e
e
o °
Tag .
e
Block
oc 1 .
e
e
[Tag | Blocki
FN eel
er ,
Tag : .
eo
Block 127
e *.
.
2
Block 16382
cache memory
Block 16383
here every block frame of cache memory is mapped with main memory
all the blocks of main memory
that means any block from MM can be placed in any
block frame in cache memory
Main memory address 14 4
Tag Word
Fuily associative In terms of performance, this is the best and most expensive
cache organization, The mapping is such that any block in memory can be in any
block frame. When a request for a block is presented to the cache, all the map entries
are compared simultaneously (associatively) with the request to determine if the
request is present in the cache. In the running example, 14 tag bits are required to
identify the memory block when it is present in the cache. Figure 2.24 illustrates the
fully associative buffer. The mapping flexibility permits the development of a
wide variety of replacement algorithms, some of which may be impractical.
Although the fully associative cache eliminates the high block contention, it
encounters longer access time because of the the associative search.
Block 0
—*) 8bits =—
Block |
T
Block 0
Set 0
ag .
Block | Block 63
Ti Block 2 Block 64
Set 1 Block 65
Block 3 .
*
.
Set 63 Blockn 1
126 Block° 4095
Block 127
main memory
Several possible schemes are used for mapping a physical address into a set
number. The simplest and most common is the bit-selection algorithm. In this
case, the number of sets 5 is a power of 2 (say 2*). If there are 2’ words per block,
the firstj bits, 0,..., 7 — 1, select the word within the block, and bits j,...,j + &
— 1 select the set via a decoder. Hence, for the example, the 6-bit set field of the
memory address defines the set of the cache which might contain the desired block,
as im the direct-mapping scheme. The 8-bit tag field of the memory address is
then associatively compared to the tags in that set. If a match occurs, the block is
present. The cost of the associative search in a fully associative cache depends on
the number of tags (blocks) to be simultaneously searched and the tag field length.
The set-associative cache attempts to cut this cost down. and yet provide a per-
formance close to that of the fully associative cache. For this reason, it is the most
commonly used placement policy for cache memories.
The main consideration in choosing the values for S and £ depends on direc-
tory lockup time, cost, miss ratio, and addressing. Note that S and & are inversely
related, assuming a constant M = SE = 2”. The set size defines the degree of
WWW.Gitmgurgaon.blogspot.com
106 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
associative search and thus the cost of the search. The addressing scheme used
can indicate whether an overlap in cache lookup and the translation operation
(via TLB) is possible in order to reduce the cache access time. Recall that the only
address bits ofa virtual address that get mapped in a virtual memory system are
the ones that specify the page address. In order to illustrate how an overlap may
occur, assume that there are 2’ bytes per block and 2* sets in the cache. Let the
page size be 2” bytes. Assuming bit selection mapping, p — j bits are immediately
available to choose the set, since the low-order p bits, which specify the byte within
the page, are invariant with respect to the mapping. It is quite advantageous to
make p ~ j = k so that the set can be selected immediately, in parallel with the
translation process. This overlap is shown in Figure 2.26. However, if p — / < 4,
then the search for the cache block can only be narrowed down to a small number,
gk Prd sets,
The effect S and E have on the miss ratio can only be measured by trace-
driven simulation ona typical work load. However, Smith (1978) derived a relation-
ship between the miss ratio for fully associative and set associative cache organiza-
tions. Assuming an LRU stack-programming model with coefficients drawn from
the linear paging model, it was shown that the ratio of the miss ratios between the
set associative and the fully associative cache is
E-1/S
R(E, 8) = ELT for E = 3 (2.26)
Figure 2.27 shows an example on the effect of Sand £ on the miss ratio. Experimen-
tal results have shown for uniprocessors that a set size in the range of 2 to 16 per-
forms almost as well as fully associative mapping at little cost increase over direct
mapping. Notice that when £ = M, it is the fully associative mapping and when
E = 1, it is the direct-mapping scheme. Table 2.2 shows some examples of systems
that use set-associative cache and the choices of § and E.
enna Pant
k—e j
Byte
2 - :
ie ne her within Virtual address
block
7 Overlap r i
|i i'
Use set number
Search TLB i
i 10 select set i
It
i {
f ;i
f
— Compare
1 address
. Send virtual Update
ae tee address to replacement ——
dynamic address status of
replacement translator (DAT) TLB Update
replacement
1 status Of set
Translate
virtual address
to real address
RO - Get block:
addrees
real)al} addv ss
Get block from from cache
pal in TLB memory using block
real address; select i t
set entry for
lac Select
replacement eidesired bytes|
from block
Y
Store block v
in cache :
Send byte/
Cd word to
processor
Figure 2.26 Set associative cache operations. flow chart not needed
WWW.Gitmgurgaon.blogspot.com
108 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
WEV-APL-WTX-FFT
0,100 -— Q-10000, 32 byte blocks q
[ M 32 sets 4
- \ srmeneenes G4 SETS 4
- AN -o-e- 128 sets 7
‘ .
=
Miss ratio
0.010
0.005
Pf tt i
0 20,000 40,000 60,000
Cache capacity
Figure 2.27 Miss ratios as a function of the number of sets and cache capacity (Smith, 4CAf Surveys,
4982).
Block and cache size selection In caches, a block is so small that spatial locality
effects are the main consideration in the choice of the block size. The effect of
cache size and block size on the hit ratio relates to spatial and temporal localities.
For a given cache size, the miss ratio improves as the block size increases, because
an increase in the block size captures more of the spatial locality. This improvement
is achieved to the detriment of the temporal locality, because the total number of
System (S| E)
Block 0
Block | Sector 0
Block
TS
Block 16
Block L Sector |
Sector 0 ®
Block 3]
Block 14 ae
<15
Block 16
Sector | : °
Block 31
*.
OO ee
Biock 127
eae lock, 16368
cache memory ° Sector 1023
“Block 16383 _
main memory
blocks in the cache is diminished proportionally. At some point, the miss ratio
curve levels off for a block size such that the effects of a block-size increase on
both localities compensate each other. Beyond this block size. most of the spatial
locality has been captured and the miss ratio curve inflects, as the cache is not
capable of holding the temporal locality of the program.
When the number of blocks in the cache is so small that blocks are swapped
constantly between the cache and the memory, the efficiency of the cache goes to
zero. The block size corresponding to the minimum miss ratio depends not only
on the cache and its organization but also on the program behavior or work load.
Because properties of programs vary widely, the choice of block and cache sizes
must result from extensive simulations based on traces of programs constituting
a representative work load. In Figure 2.29, we show two example miss ratio curves
for a given work load in which the time slice Q is varied and represented in a
number of memory references. It can be seen that @ also affects the selection of the
block and cache sizes,
WWW.Gitmgurgaon.blogspot.com
Vary block size
y F T T } oc 4 —T Tv ] ¥ T T ¥ ] T
WFV-APL-WTX-FFT
0.100 Q-10006, 64 sets
8 bytes
0,050 L. AS
4
Q
8 -
$
64 bytes wa 16 bytes
0.010 } Te
C 128 bytes
0,005 &
r 256 bytes
~,
r vena
| 14. £ 1 | 4, 1. 4, i | he i he 4, | A 1,
0.050 [ 4
€ 0.010 i
CL 32 bytes
G.B05 at seowcee cern nrtntas ceeees eee
_.
1B bytes 0 tcc
ce ere
- jo i 1d | L
o 20,000 40,000 60,000
Cache capacity
(b) Quantum size, O= 250K references
Figure 2.29 Miss ratios as a function of block size and cache capacity (Smith, ACM Surveys, 1982).
110
MEMORY AND INPUT-OUTPUT SUBSYSTEMS 111
Virtual versus physical addressing We have seen that the cache is addressed
using the physical address. Although in the set-associative caches the translation
of the virtual address can be overlapped with the lookup in the cache, the lookup
cannot be completed until the physical address is available. The cache access
time could be significantly reduced if the translation step could be eliminated. In
this case, the cache would have to be addressed directly using the virtual address.
The major problem with using virtual addresses to access the cache is that these
names are defined within a process. Two processes may know the same physical
word under different virtual names. Conversely, the same virtual name may
designate different blocks for different processes. Hf the processor is multipro-
grammed, a context switch is accompanied by a cache sweep or purge. Otherwise,
the new process may issue a reference with a virtual name which will hit on a block
that had the same name for the previously running process. This problem can be
avoided by augmenting the virtual address with an address-space identification,
which makes it unique,
However, there is still a possibility that several copies of the same physical
block may exist in the cache under different names. This is called the synonym
problem and causes coherence problems within the cache. Ifa block is shared among
several active processes, several of its copies can be present in the cache under
WWW.Gitmgurgaon.blogspot.com
112) COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
different names. The solution is to avoid multiple copies of the same block in the
same cache by detecting the synonym when it occurs and enforcing consistency.
Synonyms can be detected by mapping the virtual address into a physical address
via a TLB and determining if there exist other virtual addresses in the cache that
have the same physical address. This can be accomplished by a mapping device
that is inverse to the TLB and is called an inverse translation buffer (ITB). The
ITB is accessed on a physical address and indicates all the virtual addresses
associated with that physical address that is in the cache.
To reference memory, the virtual address is applied to the TLB at the same
time as the cache. If a miss occurs in the cache, the physical address obtained from
the TLB is used to request a fetch of the block from main memory. Simultaneously,
the physical address is also used to search the ITB to determine if that block is
already in the cache under a different name (virtual address). If it is, the virtual
address is renamed and moved to its new location to avoid multiple copies of the
same block for consistency reasons. Also, the block-fetch request to memory is
discarded. Otherwise, the block fetched from memory is used and the ITB and
cache updated accordingly. The addressing of the cache by virtual addresses may
decrease the cache access time ona hit at the cost of increased hardware complexity.
Partitioned cache Another issue in the design of a cache is the partitioning of the
cache into several independent caches in order to segregate various types of
references. Usually, segregation is limited to reference types that are hardware-
detectable: for example, instructions versus data, or references in user mode
versus references in supervisor mode. It could be extended to compiler-imposed
segregation, in which references could be tagged at compile time. This extension,
however, violates the principle of cache transparency, which simplifies the com-
piler. Splitting the cache generally improves the cache bandwidth and access time.
In a pipelined system, the processor is usually physically partitioned into two
units, the | unit and the E unit. The I unit performs instruction fetch and decode
and forwards the decoded instruction to the E unit, which executes it. In the
execution phase, the E unit may fetch and store operands. By splitting the cache
into data (D) and instruction (I) caches, the J cache (D cache) can be placed next
to the I unit (E unit) to permit simultaneous access and reduce the access time.
While one instruction is being fetched from the I cache, another instruction in the
E unit can be accessing its operands from the D cache.
It is generally known that a significant fraction of misses is due to task switch-
ing for the execution of supervisor tasks. In order to reduce these miss transients,
the cache can be split between a user cache and a system cache. Depending on the
mode of execution, one cache or the other is referenced, Note, however, that the
supervisor cache may still have a high-miss ratio because of its large working set.
The most obvious problem with split-cache organization is the consistency
problem, because two copies of an information may now exist in separate caches.
For example, in a pipelined processor, instructions being modified by the E unit
must be stored in the I cache before they can be fetched. However, the E unit can
access the D cache. Even if we assume that programs are not self-modifying, a
MEMORY AND INPUT-OUTPUT SUBSYSTEMS 113
cache block may contain instructions and data. Presumably, this effect can be
minimized by designing compilers to insure that instructions and data are in
separate blocks. Another problem with split cache results in possible inefficient
use of cache memory. Locality properties of instructions and data are not homogen-
eaus in this case. The miss ratio may increase as a result of splitting the cache.
However, this depends on the work load. Examples of systems with split cache are
the S-1 and the Amdahl 580.
WHAT IS FETCH AND
NOT NEEDED UPDATE
2.4.3 Fetch and Main Memory Update Policies
As discussed earlier, this policy is used to decide when and what information to
fetch into the cache. There are three basic types of fetch policies which are applied
ta cache: demand, anticipatory, and selective fetches. Demand and anticipatory
fetch techniques used for paging systems can be applied to caches. In selective
fetch, some information, such as shared writeable data, may be designated as
unfetchable; further, there may be no fetch-on-write when a miss occurs, as dis-
cussed below.
Prefetching can be successfully used to prefetch the needed blocks ahead of
time so that the cache miss ratio can be reduced. The major factor in determining
the usefulness of prefetching in a cache is the block size. It has been found that a
block size of less than 512 bytes results in useful prefetching. Only the OBL pre-
fetch algorithm is usually considered because of its ease of implementation at
cache speeds, Several possibilities exist for deciding when to initiate a prefetch.
For example, for all i, prefetch block ¢ + 1 if'a reference is made to block i for the
first time.
This technique, termed always prefetch, while good in reducing the miss ratio,
creates more traffic to the cache and main memory. In multiprocessor systems,
this may be detrimental. A refined technique is to prefetch block i+ 1 only ona
miss to block i. Yet another technique is tagged prefetch which, in addition to
prefetching on a miss, also prefetches block i + | if a reference to a previously
prefetched block i is made for the first time. Prefetching has been found to be very
effective in pipelined systems such as the Amdahl 470 V/8, which uses prefetch on
a miss.
One technique used to reduce the wait time of the processor during the fetching
of a missed block is to forward the requested word directly to the processor first
and then complete the fetching of the block in a wraparound fashion. This tech-
nique is called load-through or read-through.
The time when a word in memory is updated after a write depends on the
write policy. One possibility, write-through (WT), is to update directly the memory
copy of the data word. In this case, the copy ofa block in the cache is never different
from its copy in memory. Two variations of write-through are possible. The first
is the write-rhrough-with-write-allocate (WTWA) policy, in which a block is loaded
into the cache on a write-miss. In WIWA, both read and write references con-
tribute to the hit ratio. The second possibility is the write-through-with-no-
write-allocate (WTNWA) policy, in which blocks are loaded into the cache on
WWW.Gitmgurgaon.blogspot.com
114 COMPUTER ARCHITECTURE ANID PARALLEL PROCESSING
read-misses but not on write-misses. In WT, the effectiveness of the cache is limited
by the fact that 5 to 30 percent of all memory references are write operations.
When no buffer is provided at the memory, the processor is blocked during
the write-through. In general, one can consider that the memory address and
data registers form a buffer of size one. If another write-through or a miss occurs
when a previous write has not been completed, the processor is blocked.
In order to estimate the effect of the write policies on the average memory
access time, let «, be the fraction of writes in the system and assume a read-
through fetch policy. Also, let t,, t,,. and tg be the cache cycle time, the memory
cycle time (t, < 1,,), and the block transfer time, respectively. Assuming a WTWA
policy, the average time to complete a reference when no buffer is present is
t, + (1 — Alte + @((t,, — t,) (2.27)
Note this assumes that the writes to cache and main memory are performed
simultaneously in the case of a hit and the miss ratio is | — 4, For WINWA
policy, the average cycle time is
and it increases performance at low cost by reducing the average time the processor
waits on a cache miss.
To further improve the performance, the write-backs on misses have to be
buffered. In a technique called flagged register write-back (FRWB), the modified
black selected for replacement is first written in a fast register to avoid interfering
with the fetch. The new block is then brought into the freed cache block frame. The
block write-back to the memory is activated later and is completed “in the back-
ground.” In an extension to this policy, the blocks to write back could be buffered,
as the modified words are in WT. The fetching of blocks from memory to cache
on misses is given higher priority, and special hardware checks for the possible
presence of a requested block in the write-back queue. The cost effectiveness of
such an extension depends on the relative improvement achieveable beyond
FRWB.
As for WT, we can estimate the efficiency of write-back strategies for some
special cases. For SWB, the average reference time is
t, + 201 — Ajtg (2.30)
For FWB., it is
te + (1 — ty + owyty) = te t+ Ul — AYU + wy Jeg (2.31)
where w, is the probability that a replaced block has been modified.
The comparison between WT and WB is rather complex and depends on the
program behavior. However, three major factors influence the effectiveness of
WT or WB in a given system: the extent of memory traffic, data consistency, and
reliability. WT usually results in more memory traffic, which can be very detri-
mental to the performance of a system with multiple processors. However, when the
WT is used, main memory is always consistent with the cache, since the memory
always has the updated copy of the data. Thus the failure of a processor and its
cache permits recoverability.
WWW.Gitmgurgaon.blogspot.com
446 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Hit clock
— 4
h Xo Ny roy Ly Wo
dD
ck
7 dD
ck
> dD
ck
? D
ck
>
ck ck ck ck
f! D x i > D Y, i > D Zz. i 5 D Ww t -
NX NY NZ
block in the LRU stack. NX is 1 ifthe block that has just been accessed is not block
number 4’; otherwise, NX is 0. The values of
N ¥and NZ can be obtained similarly.
Whenever a request results in a hit, a hit clock is generated immediately te control
the updating process. Each of these three control signals, together with the hit
clock, determine if the corresponding block should be shifted to the right in
the LRU stack. The number of the block that has just been accessed is loaded into
the leftmost pair of D flip-flops every time a hit in this set occurs. The contents
of the other pairs are shifted to the right until the previous position of the just-
accessed block is reached. The rightmost pair of the D flip-flops always indicates the
number of the least recently used block in the set associated with this LRU stack.
A third implementation uses E(E-1) active bits of status for a set with £
elements. These E(E — 1) active bits are derived from an E-by-E binary matrix in
which the diagonal elements are passive and always zero. When the block in the
jth block frame is referenced, the jth row of the binary matrix is first set to all 1’s
and then the jth column is set to ali 0’s. It is easy to show that, using such a scheme,
the most recently used block is always the block in the block frame that has the
largest number of 1’s in its row. Similarly, the least recently used block is in the
block frame with the smallest number of |’s in its row.
The three implementation schemes discussed above require a number of
status bits that increase with the square of the set size. For a small set size (4 or 6),
WWW.Gitmgurgaon.blogspot.com
118 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
it is acceptable. However, for machines with a large set size (8 for IBM 370/168-3;
and 16 for IBM 3033), it may be too expensive and slow. In such systems, the set of
elements is partitioned into nonoverlapping groups. The LRU group is determined
and the LRU element within the group is selected for replacement. If this scheme
were applied to a set size of 8, in which the groups consist of2 elements each, the
implementation would use 20 active status bits instead of 56.
It has been shown that, in general, the effect of cache replacement algorithms
on the performance of the cache is secondary when compared to the effect of the
mapping on performance. The fully associative cache is most sensitive to the re-
placement algorithm (and least sensitive to mapping), while the direct-mapping
cache 1s the most sensitive to mapping (and least sensitive to the replacement al-
gorithm).
Memory Memory
Address bus
Data bus
Control bus
interface interface
Vo
subsystem:
[/O lO
device device
There are many different types of peripheral devices. Most of them are electro-
mechanical devices and hence transfer data at a rate often limited by the speed of
the electromechanical components. Table 2.3 shows some typical peripheral
devices. Bubble memories, disk drums, and tape devices are mass storage devices
which store data cheaply for later retrieval. Typical capacities of mass storage
devices are: fixed-head and moving-head disks 512M bytes; floppy disks, 1M
bytes; 9-track tape, 46M bytes; and cassette tape, from 64K to 512K bytes.
Display terminals are input-output devices which consist of keyboards and cathode
ray tubes (CRT). The keyboard acts as input while the CRT is the output display.
In some cases where the CRT is replaced by a printer, the terminals are called
teletypes.
Since terminals are often used interactively and are relatively slow devices, a
reliable technique for transmitting characters between the processor and the
terminal is serial data transmission. This method is cheaper than parallel trans-
mission of characters because only one signa! path is required. Data communica-
tion over long distances is usually done serially. For this reason, remote communi-
cation can be done over telephone lines by using a modem (modulator-demodulator)}
interface. The modem is used at each end of the transmission line. There are a
variety of character codes used in the transmission of data. However, one of the
standard codes often used is the American Standards Committee on Information
Interchange (ASCIE), which uses seven-bit characters.
WWW.Gitmgurgaon.blogspot.com
120 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Read device
status
Read data
from device
interface
Figure 2.32 Programmed-driven 1/0.
WWW.Gitmgurgaon.blogspot.com
controller is having a buffer
122 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
of data froma device to main memory. The CPU initializes a buffer in main memory
which will receive the block of data after the I/O transaction is complete. The
address of the buffer and its size are transmitted to the device controller, and the
address of the required block of data in the device, is also given to the controller.
The CPU then executes a special “start 1/O” command which causes the 1/O
subsystem to initiate the transfer. While the transfer is in progress, the CPU will
be free to perform basic computations, thereby improving overall system perform-
ance. When the block transfer is complete, the CPU is notified. Notice that since
the CPU and the controller share the main memory, the device will periodically
“steal” memory cycles from the CPU to deposit the data in memory. The cycle-
stealing is very effective since the devices are often slower than the CPU. When
the CPU and the device controller conflict in accessing the bus or a memory
module, the device is given priority over the CPU in the access since it is a more
time-critical component. This type of I/O data transfer scheme is called direct
1 Transmitter empty
Status Generale interrupt
K register request
f
‘ } Interrupt Interrupt
8 rd logi >
To Data #4 Write Gale (to CPU}
CPU <a obus KOT EB
data bus butfers Zo vain. & Receiver full
Receive Parit
K data heck,
: register enec
memory access (DMA). The I/O controller often used for DMA operations is called
an 1/O data channel.
Notice that the DMA facility does not yield total control of the I/O transaction
to the I/O subsystem. The I/O subsystem can assume complete control of the
[/O transactions if a special unit, called an 1/O processor (IOP) is used. The IOP
has a direct access to main memory and contains a number of independent data
channels. It can execute [/O programs and can perform several independent 1/O
transactions between main memory and devices or between two devices without
the intervention of the CPU. IOP executes IO instruction only
Error indi¢ators
WWW.Gitmgurgaon.blogspot.com
124 COMPUTER ARCHITECTURE ANT) PARALLEL PROCESSING
System
clock
1/0
—p| Register direction
e select register
e and
e read-write
control
Control
and
Inierrupt request
Sstalus ie
To Data for pon I
register
cre bus
ala buffers for port 1
bus .
Internal bus
Cc “ontral and
Status Interrupt request
register for port 7
for port 7
Port Bidirectional
Hn data
A eee f
va
study this fig 2.35 along direction
with the explanation register
Receiver interface
Sender
TRI-
To
sph,
cre data =A
STATE
buffer eee Latch
;
“ EN CLK Device
Device select TR INPUT STROBE ;
Read I/O —4 a .
(from CPU) (data ready) Handshake
gf R . O| INPUT LATCH EMPTY| lines
1s (data acknowledge)
Interrupt — \ 5 QO
enable +> } FF
(from CPU) R
CPU
handshake eS
lines interrupt
request
(io status register
and interrupt circuit)
vo
c— controller
a Interrupt
¥ ]
i/o
device |
System bus
Memory
Ko tall
controller 7
Interrupt
b a
j
vo
device
V/
Figure 2.37 Pelied interrupt method.
WWW.Gitmgurgaon.blogspot.com
128 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
starting address or interrupt vector of the device. This interrupt vector permits the
CPU to transfer control to the device service routine at the corresponding vector
location in main memory. A system that possesses this capability is said to have
vectored interrupts.
A vectored-interrupt system requires a priority scheme to be provided in the
hardware. This priority scheme could be fixed, rotating, or dynamic priority. When
the CPU accepts an interrupt request, it sends an acknowledgement to the vectored-
interrupt controller. The controller, upon receipt of this acknowledgement, sends
the unique mterrupt vector of the highest priority device of the set of unmasked-
interrupting devices. This action is illustrated in Figure 2.38. The interrupt-
acknowledge signal can in turn be transmitted to the highest priority device
controller, which caused the interrupt in order to reset the interrupt request from
that device,
Channel architecture There are basically two types of channels: selector and
multiplexor, as used in the [BM 370 systems. A selector channel is an IOP designed
to handle one I/O transaction at a time. Once the device is selected, the set of L/O
operations for a given transaction runs to completion before the next transaction
is initiated. The selector channel is thus normally used to control high-speed 1/O
*waysks jdnuiayyy paiojsaA Bez aansLy
129
Ja 1s18a1 ySey
beg! 9h py
(Add Woy) (Add
aSpepMOUyoe £q 198)
yWaiiswy 4sey
Py indy
WWW.Gitmgurgaon.blogspot.com
cd
aa
IOWA Japosua
yirwsluy ‘PRT ° ANLOuid
‘o 7
% %
qaous
dA
ha Pot ¥yy
Burpusd ydnisiuy (5.19]]011002
aaasp
woz)
{ndD 01) Soul
yonbai ysenbai
qdniiaquy ydnuiseqi
qapusdepuy
yore;
ysonbo2 ydniayuy
134 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
main memory
fopu) (cou eee (rv) MEM| eee (MEM)
i System bus
yf ot >
| 1/0 processor
VO i - .
channel | | Wo | eee vo
| channel J | channel nn ||
1/O
(Device=) controller
|
“
. 3
. D
_ . rg | oO | Multichanne! vo
(O . “yo. ly & | controller switchmh | somroller |
DevEEET controller
wee”
7 5 i
T Il
Low- os High- y ~
speed speed (Device ——v (Devic
devices device Le [ vo | —
controller
devices such as fixed-head disks and drums. Figure 2.40 shows the architecture
of a typical selector channel. The channel consists of word assembly and dis-
assembly registers (WAR and WDR), which store the current word being received
from the external device and the current word being transferred to the device,
respectively. Fhe channel can be made to receive or transmit data in character,
halfword, or fullword mode. Thus, the assembly-disassembly registers can operate
accordingly. The devices could be operating too fast (in the case of input) for the
channel to handle the data reliably. For example, the next character could arrive
before the current one in the WAR has been transmitted to the CPU. If this occurs,
an ovetfun error or buffer full interrupt is generated to the CPU, which can
request retransmission. One way to alleviate this problem is to double-buffer the
input data. This discussion can also apply to the WDR when operating on a slow
output device.
The initialization of the selector channel requires the definition of the location
of the first word in memory, the length of the block to be transferred, and the
device address. The initialization program is stored in memory and can be ex-
ecuted by the channel in order to initialize its internal registers. The registers used
in this case are the device address register (DAR), the block count register (BCR),
MEMORY AND INPUT-OUTPUT SUBSYSTEMS 131
WDR Parity
> gen Output data
buffer (to device}
Fo Channel @ ae : Transfer
system <Cn butter ee BK and : BCR complete
data bus 3 status interrupt
g
= MAR Memory address
Receive Buffer
VV
BCR: block cauat register
BAR: device address register
MAR: memory address register
WAR: word assembly register
WODR: word disassembly register
and the memory address register (MAR). In order to perform an I/O transaction,
the CPU transmits a START signal and the device address to the chatinel on which
the selected device is attached. The channel then fetches the channel address word
(CAW) from a prespecified location in memory. This word, which was stored
prior to initiation of the I/O transaction, contains the starting address of the
I/O program (called channel program) to be executed by the channel.
The channel program consists of channel command words (CCW) or control
words or instructions, In most sophisticated channels, the channel programs may
include commands for positioning the read-write heads of disk drives, rewinding
tapes, and selecting or testing the status ofa device. In addition, the set of CCWs
may contain instructions which permit looping and branching. The concept
of the single-channel program can be extended to the CPU preparing an arbitrary
number of I/O transactions to be executed by the [YO subsystem as a sequence
of 1/O transactions. This feature is known as command chaining.
If the addressed device is available, the channel executes the channel program
to perform the I/O transaction; otherwise, the request may be queued or the
CPU notified of the unavailability of the device. If the channel program is executed,
WWW.Gitmgurgaon.blogspot.com
132 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
the DAR, BCR, and MAR are initialized and the block transfer initiated. Subse-
quently, the MAR contains the current memory address and the BCR contains
the remaining block length. After the transfer ofa word between the main memory
and the channel, the MAR and BCR are incremented and decremented by one,
respectively, to reflect the updated values. When the BCR counts down to Zero,
a “transfer-complete” interrupt is generated and sent to the CPU. In case of
errors (parity or lost character), an error interrupt is also generated. The typical
maximum data rate of a selector channel is on the order of | to 3 megabytes/s.
A multiplexor channel is an IOP which can control several different 1/0
transactions concurrently. In this case, the data transfers are time-multiplexed
over the 1/O interface. This type of channel can be further divided into block and
character multiplexors. The character multiplexors are used to handle low-speed
devices, whereas block multiplexors are used for medium- and high-speed devices.
The block or character multiplexor consists of a set of subchannels, each of which
can act as a low-speed selector channel, as shown in Figure 2.41.
Each subchannel contains a buffer, device address register, request flag, and
some control and status flags. However, the subchannels share global channel
control. Each subchannel is required to have a memory address register (to main-
tam the current memory address) and a block count register (to maintain the length
of block remaining to be transferred). In a character multiplexor channel with a
large number of subchannels, as in the IBM 370 system with 256 subchannels, it
is cost prohibitive to maintain these pairs of registers in the subchannels. Hence,
these registers are maintained in main memory and are accessed by the channel
control, as shown in Figure 2.41, The channel controller can select a subchannel
for a burst mede or multiplex mode. In the multiplex mode, the scan control
cyclically polls the request flag of each subchannel. If the flag is set, the subchannel
is selected for a character or block transfer. The subchannel mode control is checked
to determine the direction of the transfer operation. When the character or block
is transferred, the next subchannel is polled. The block multiplexor interleaves
by blocks instead of characters as in a character multiplexor.
For example, suppose that three successive I/O transactions X, Y, and Z
are requested. Assume that each transaction is required to transfer a string
of n characters. X, Y, and Z are sequences of characters ¥4,.X,,...,Xy—45 Yo:
Yi,.0+. Wy, and Zy,2Z,,...,2,-1, respectively. If these transactions are
initiated on a selector channel, then the selector channel transmission appears
as Xo Xy... My Nyo h-.. -1494,---Zy—-y On a character multiplexor with
at least three subchannels, they may appear as Y9 Yo Zo X1Vi2%,..-Xn—1 Vy Zy-4-
On a block multiplexor programmed for k characters per blocks (assuming that
k <n), the sequence may appear as XgX,...Xy2) YY... Nee ZoZ, Za y
My Xyog ees Nagi
Ye Mea Voy 7 Zp Zuey os Zay-_--- and so on. The frequent
switching and the associated overhead degrade the performance of the character
multiplexor, The maximum data rate for the character multiplexor is typically
on the order of 100K to 200K bytes p/s. The maximum data rate of the block
multiplex or channel approaches that of the selector channel as the block size k
approaches the string length n.
MEMORY AND INPUT-OUTPUT SUBSYSTEMS 133
Subchannel |
Char
buffer |
Req
flag ati Request
) | [ Made~
et
an
po
Parity error stars
Error a—t tno | seem eneeiend
interrupt Lo
Character
lost
Character
bus
Ch 1 selector
T
anne and parity
check co
° system
bus bufter
channel
i logic __ subchannel Atl
| 4 A Transfer 4 send/
Char
| od
boasem Peceive
>
Internal
Mode
contro] fp
| status |
WWW.Gitmgurgaon.blogspot.com
134 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Register Assembly’
file disassembly
Instruction
fetch unit
Figure 2.42 The Intel-8089 {/O processor with two separate [/O channels (El-Ayat, IEEE Computer,
1979).
20 «19 0
General purpose address register A (GHA)
Fi
General purpose address register B (GB) 4 bia
d.
General purpose address register C (GC) weiner.
1/0 or
memory space 15 0
Index (1X)
Parameter pointer
Figure 2.43 Register set ofa channel in Intel JOP (Courtesy of IEEE Computer, 1975, El-Ayat).
they may terminate the current DMA transfer. The channel control register (CC)
is a special 16-bit register which defines the channel’s operation during DMA
transfer operations. In addition to the user-programmable registers, there are
two non-user programmable 20-bit registers, also shown in Figure 2.43.
The assembly-disassembly register file is used in the DMA transfer mode.
For example, when data is transferred during a DMA operation from an 8-bit
bus to a 16-bit bus, the FOP assembles 2 bytes in its assembly register file before
transferring a word to the destination. A simplified computational model of the
INTEL [OP is given in Figure 2.44. After reset, a channel attention (CA) input
pulse forces an internal initialization sequence. Then the processor is ready to
dispatch an I/O transaction request to either of the two channels to perform the
desired [/O task. The I/O channel normally begins its aperation in the task block
(TB) state with the execution of the I/O program and enters the DMA state under
IOP program control. In this state, the channel proceeds with high-speed data
WWW.Gitmgurgaon.blogspot.com
136 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
initialization
Initial-
bonet
SySysiem initialization
End
ization a
Channel dispatcher
Channel 2
1
;
t 1 *
! ! ' TB :
w process i process '
o 1 1
eS <i ' at
=
G t { x
2 z | i ee a |
g a} 3 a!
E BI e° zi
&a os ‘ mt
DMA DMA
transfer ' ; transfer
j
>
E
Be 4
!
t
g
>
Hi
qy
3 xe | Z Bi
< Zz f < 3}
3a é |' t
5
cy
z |‘
a
t
t i |
|
nent pec nenne een ene conn ce eee mee eentn en mend hen cee er ee em ee eet eee eee aed
Channel | controller Channel 2 controller
Figure 2.44 Simplifted computational model of the Intel [I/O processor (Courtesy of (EEE Compater,
1979, El-Ayat).
MEMORY AND INPUT-OUTPUT SUBSYSTEMS 137
To central memory
controller and
processing subsystem
i Bus |
Jd
PPU, PPL, eve
|
PPU,.
Sob b46
Dc | DC pc pc be
WWW.Gitmgurgaon.blogspot.com
138 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
10 programs /
in
barrel
Slot
(time-shared
instruction
control)
Central Centrat
memory memory
{60) (60)
(12) (12)
(12) .
_, Real time
Figure 2.46 Barrel processing of 1/O transactions in CDC integrated peripheral processing units (Courtesy
of Control Data Corp.).
The CDC 6600 integrated peripheral processor uses a so-called barrel design
to share logical units within the IOP. It uses a set of registers to share a common
arithmetic logic unit and a data distribution system in a synchroncus fashion.
The barrel contains 10 peripheral processing units (PPUs) and a PPU is 12-bits
wide, A PPU instruction requires a number of steps for its execution. The execu-
tion in each step is performed in a distinct ‘slot’ which logically represents a
PPU. Hence, the PPU instruction is executed as in a cyclic pipeline process, as
shown in Figure 2.46. This execution sequence is possible because each instruction
MEMORY AND INPUT-OUTPUT SUBSYSTEMS 139
cycle is an integral number (up to 19) of minor cycles. A minor cycle is 100 ns and
a major cycle is 1000 ns; hence, the choice of 10 PPUs.
In each minor cycle, all information in the barrel is moved one position (syn-
chronously) after each step is executed tn its current slot. The information in each
PPU is moved through the shared slot position once every major cycle. Since each
PPU operates once per major cycle, the maximum data rate is 12 bits + 1000 ns =
12 x 10% bits/s, Therefore, the 10 PPUs are time-shared by the slot hardware with-
out significant degradation in performance. However, since the CDC 6600 is a
60-bit computer, five PPU transfers are required to form a 60-bit word. Also,
since the I/O processing is synchronized in the CDC system, no handshaking is
necessary as in the IBM channels.
I/O configuration in systems with cache There are two basic methods of connecting
an I/O subsystem to the processor-memory complex in a system with a cache, In
the first configuration, the I/O channel can be attached to the cache so that the
cache is shared by the processor and channel, as shown in Figure 2.474. The
channel competes with the processor for access to the cache. An I/O channel is
often slower than the processor, Thus connecting the channel to the cache does
not significantly improve the performance of I/O transfers. I/O transfers have
little locality and they increase the traffic between the cache and memory. This
increase is caused by three main effects: main memory update of memory-bound
I/O data; misses caused by channel fetches from memory; and channel programs
(and I/O data) occupying cache, reducing the effective cache aggregate miss ratios
seen by processor-bound jobs. The configuration of Figure 2.47a@ may also en-
counter cache data-everrun, in which the data transfer occurs at a rate higher than
the cache controller can sustain.
An alternate configuration is to connect the channel to the memory directly,
as shown in Figure 2.47b. In this case, the channel competes with the cache con-
troller for access to the memory, However, the I/O channel and processor execu-
tions conflict at miss times only, assuming a write-back memory update policy.
Also, the cache is not encumbered with the data blocks destined to I/O. It has,
however, one major drawback: data consistency or coherence problems. To
illustrate, consider a cache which uses write-back main memory update policy.
Assume that the processor has modified a copy of a data element X in the cache
so that the value of the copy in the cache is NEWX and the memory has not been
updated.
Let OLDX be the value of X in memory. Before the memory is updated, the
L/O channel requests a fetch from location X in the memory, which delivers
OLDX instead of NEWX. A coherence problem has occurred. One solution is to
keep a dynamic table in the memory controller which, at any time, indicates the
set of blocks in the cache and their status (whether modified or unmodified). Let
the modified status be denoted by RW. When the I/O channel makes a reference
to a memory block which is also in the cache, the status is checked by the memory
controller. If it is RW and the channel requests a read, the data is fetched from the
cache, However, if the channel requests a write to the block, the corresponding
WWW.Gitmgurgaon.blogspot.com
140 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Memory
Cache
| Storage control
Channel
Memory
Memory control
and arbitration
Cache Channel
CPU
cache block frame is invalidated before the memory block is modified by the chan-
nel. A similar description can be given for a processor reference,
Note that for asystem with buffered write-through update policy, the coherence
problem is automatically corrected in the second configuration if the write queue
is maintained within the memory controller. However, this configuration may
also encounter the data-overrun problem. More cache coherence studies will be
given in Chapter 8.
MEMORY AND INPUT-OUTPUT SUBSYSTEMS 141
Problems
2.1 Consider a two-level memory hierarchy (4,, 4,) for a computer system, as depicted in the
following diagram. Let C, and C, be the costs per bit, S, and S$, be the storage capacities, and ¢, and
1, be the access times of the memories M, and M,, respectively. The hit ratio H is defined as the
WWW.Gitmgurgaon.blogspot.com
142 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
probability that a logicat address generated by the CPU refers to information stored in Af,. Answer the
following questions associated with this virtual memory system.
(a) What is the average cost C per bit of the entire memory hierarchy?
(6) Under what condition will the average cost per bit C approach C,?
(c) What is the average access time 1, for the CPU to access a word from the memory system?
id) Let r = t,/1, be the speed ratio of the two memories. Let E = 7,/t, be the access efficiency
of the virtual memory system. Express £ in terms of rand #. Also plot £against Hforr = 1,2, 10, and
100 respectively on a grid-graph paper,
(e) Suppose that r == 100, what is the required minimum vatue of the hit ratio to make £ > 0.907
2,2 A page trace is a sequence of page numbers P = ry. ry, fay... Peete Pas tha dee ees where r, is the
page number of the kth address in a sequence of addresses
The fault rate F is the number of page faults divided by the number of page addresses (length of the
page trace), For this example, F = § = 0.625. The hit ratio A is | — F. Far the remainder of this
problem, let P = ebucabdbacd,
(a) Produce a table similar to the above table for page trace ? under a FIFO replacement algo-
rithm with memory size, /Mp{ = 2 page frames. What js the hit ratio?
(6) Do the same for an LRU replacement algorithm.
{e) Repeat (a) and (6) for |Mpj = 3 page frames.
(2) Intuitively, both FIFO and LRU would seem toa be “good” algorithms. A mest recently used
(MRUV) algorithm intuitively sounds like a bad algorithm. Repeat {a) with an MRU replacement
algorithm, Compare this with the results obtained in (a) through (c}. What does this say about the
particular page trace P and about the generality of results obtained by comparing replacement algor-
ithms based on a single page trace?
2.3 Ina uniprocessor with cache, the processor issues its memory access requests to the cache controller
(CC), In the case of a miss oF a write-through. the CC interacts with the memory controller (MC).
Draw the Roweharts describing the operations of a CC for a read and a write operation. Consider the
write-back-write-allocate with flagged swap and the write-through-write-allocate strategies, Assume
that no read-through is implemented. Indicate how to modify the flowcharts for (a) a write-back-
write-allocate with simple WB and with flagged register WB and for (6) write-through without write-
allocation.
MEMORY AND INPUT-OUTPUT SUBSYSTEMS 143
begin
ffound-N+1;i«-0;
while (ifound#/) do
begin
f—i+1;
if (template=data [/])
then ffound</;
end
end
In this program, dataf’] is an array of M(=2”) floating-point numbers; cemplare is a floating-point
number; Ni, and found are integers. A floating-point number occupies two memory words, while an
integer occupies one memory word only. Assume that the program code as well as the variables Nyt,
ifound, and template fit on the same memory page. Dafa [*] is stored in a set of consecutive pages,
starting at the beginning of a page. A page is P = 2” words long. The memory is Af = 2” words long
(M < N). Assume that there is one and only one element equal to temipiaie in the array data. The
algorithm is run on @ uniprocessor with a paged virtual memory system. The replacement policy is
LRU.
(a) If Probability[ifound = (] = (1/N) (1 sis N), determine the mean number of page faults
in the cases where the memory does not contain any of the process pages at the beginning of the process,
and where the memory is preloaded to capacity with the program page and the first 2""? — 1 data
pages of the process.
(6) Repeat (a) if Probability{¢found = (] = G(N)g(1 — q)'' for | <i < NL 0 <q < 1, where
GN) = If] = (Ll ~ 9).
2,8 A computer architect is considering the adoption of write-through-with-write-allacate (WTWA) ar
write-back (WB) cache management strategy. Assuming no read-through, each block consists of 6
words, which can be transferred between main memory (MM) and cache in 6 + ¢ ~ 1 time units,
where ¢ is the MM cycle time. The cache has a hit ratio indicated by the parameter #. The probability
that a memory reference is a write is w, and the probability that the block being replaced in the cache
was modified (in WB strategy) is w,. Usually w, > w,.
(a) Using each strategy, give a formula for the expected time to process a reference in terms of the
above variables.
(6) Assuming w, = 0.16 and w, = 0.56, what is the performance of the WB strategy in com-
parison to WT WA strategy when (1) 4 -» E and (2) - 0.
(c) Give a general expression describing when WTWA is better than WBas a function of hand .
Assume that w, = 0.16, w, = 0.56, and c = 10,
{@) Does w, depend on A? Give intuitive reasons,
2.6 A certain uniprocessor computer system has a paged segmentation virtual memory system and also
a cache, The virtual address is a triple (s, p, d) where s is the segment number, p is the page within s,
and dis the displacement within p, A translation lookaside buffer (TB) is used to perform the address
translation when the virtual address is in the TLB. If there is a miss in the TLB, the translation is
performed by accessing the segment table and then a page table, either or both of which may be in the
cache or in main memory {MM).
Address translation wa the TLB requires one clock cycle. A fetch from the cache requires twa
clock cycles (one clock cycle to determine if the requested address is in the cache plus one clock cycle to
read the data). A read fram MM requires eight clock cycles. There is no overlap between TLB trans-
lation and cache access, Once the address translation is complete, the read of the desired data may be
from either the cache or MM. This means that the fastest possible data access requires three clock
cycles:
WWW.Gitmgurgaon.blogspot.com
144 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
one for TLB address translation and two to read the data from the cache. There are nine other ways in
which a read can proceed, all requiring more than three clock cycles,
(a) Assuming a TLB hit ratio of 0.9 and a cache hit ratio of A, enumerate all 10 possible read
patterns, the time taken for each, and the probability of occurrence for each pattern. What is the
average read time in the system? (Assume that when a word is fetched from memory, a read-through
policy is used.)
()) The above discussion assumes that the cache is always given a physical memory address.
Suppose that the cache is presented with the virtual address of the data being requested rather than its
physical address in memory, In this case, the TLB translation and cache search can be done concur-
rently. This means that whenever the requested data is in the cache, no address translation is necessary
and only two clock cycles are required for the fetch, If the data is not in the cache, either a TLB transta-
tion segment table-- page table access is needed to generate the physical address of the data. When data is
written into the cache, it is tagged with its virtual address. Find the average read time for a system
organized in this fashion. Assume that only one clock cycle is required to establish that an item is not
in the cache,
(c) What are the disadvantages of a cache using virtual addresses?
2.7 Inthe LRU stack model, assume that the stack distances are independently and identically drawn
froma distribution tg(/)i,j = 1,2,.... A, for a stack of size a,. Since cach set in the cache constitutes
@ separate associative memory, it can be managed with LRU replacement. Show that the probability
pli, S) of referencing the ith most recently referenced block in a set, given S sets, is
2.8 Consider three interleaved memory organizations for a Main memory systetf containing § memory
modules, My, M,,...,.M4,. Each module has a capacity of 2K words. In total, the memory capacity is
16K words. The maximum memory bandwidth is 8 words/cycle. In each of the following organizations,
first specify the memory address format (14 bits), then show the address assignment patterns in each
memory module, and finally indicate the maximum bandwidth when one of the § modules fails to
function. Comment on the relative merits of the three interleaved memory organizations,
(a) Eight-way interleaved memory organization (one group).
(6) Grouped four-way interleaved organization (two groups).
(c) Grouped two-way interleaved organization (four groups).
CHAPTER
NOV 3 CO- 2 CONT.......
THREE
PRINCIPLES OF PIPELINING AND
VECTOR PROCESSING
types of pipeline processors are then classified according to pipelining levels and
functional configurations. Finally, we introduce the reservation table as a design
tool of general pipelines with either linear or nonlinear data-flow patterns.
Assembly lines have been widely used in automated industrial plants in order to
increase productivity. Their original form is a flow line (pipeline) of assembly
stations where items are assembled continuously from separate parts along a
moving conveyor belt. Ideally, all the assembly stations should have equal pro-
cessing speed. Otherwise, the slowest station becomes the bottleneck of the entire
pipe. This bottleneck problem plus the congestion caused by improper buffering
may result in many idle stations waiting for new parts, The subdivision of the input
tasks into a proper sequence of subtasks becomes a crucial factor in determining
the performance of the pipeline.
In a uniform-delay pipeline, all tasks have equal processing time in all station
facilities. The stations in an ideal assembly line can operate synchronously with
full resource utilization. However, in reality, the successive stations have unequal
delays. The optimal partition of the assembly line depends on a number of factors,
including the quality (efficiency and capability) of the working units, the desired
processing speed, and the cost effectivness of the entire assembly line.
The precedence relation of a set of subtasks {7,, T,,..., T,} fora given task T
implies that some task 7; cannot start until some earlier task 7; (/ < /) finishes. The
interdependencies of all subtasks form the precedence graph. With a linear prece-
dence relation, task 7, cannot start until all earlier subtasks {7;, for alli <j} finish.
A linear pipeline can process a succession of subtasks with a linear precedence
graph.
A basic linear-pipeline processor is depicted in Figure 3.1¢. The pipeline con-
sists of a cascade of processing stages. The stages are pure combinational circuits
performing arithmetic or logic operations over the data stream flowing through
the pipe. The stages are separated by high-speed interface latches. The latches are
fast registers for holding the intermediate results between the stages. Information
flows between adjacent stages are under the control of a common clock applied
to all the latches simultaneously.
Clock period The logic circuitry in each stage 3; has a time delay denoted by 1,.
Let t, be the time delay of each interface latch. The clock period of a linear pipeline
is defined by
The reciprocal of the clock period is called the frequency f = 1/t of a pipeline
processor.
PRINCIPLFS OF PIPELINING ANID VECTOR PROCESSING 147
L: latch C: clock
5; the ith stage
L £ £ L b£
Space A
wr the jth subtask in the ith task
S, 7) r rf T4 Ti peer
5, qT n rn r Ti Jose
One can draw a space-rime diagram to illustrate the overlapped operations ina
linear pipeline processor. The space-time diagram of a four-stage pipeline proces-
sor is demonstrated in Figure 3.14. Once the pipe is filled up, it will output one
result per clock period independent of the number of stages in the pipe. Ideally,
a linear pipeline with k stages can process n tasks in T, = k + (n — 1)clock periods,
where & cycles are used to fill up the pipeline or to complete execution of the first
task and n ~ | cycles are needed to complete the remaining x — 1 tasks. The same
number of tasks (operand pairs) can be executed ina nonpipeline processor with an
equivalent function in T, == n-k time delay.
WWW.Gitmgurgaon.blogspot.com
148 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
It should be noted that the maximum speedup is $, -+ k, for > k. In other words,
the maximum speedup that a linear pipeline can provide is k, where k is the number
of stages in the pipe. This maximum speedup is never fully achievable because of
data dependencies between instructions, interrupts, program branches, and other
factors to be revealed in later sections. Many pipeline cycles may be wasted ona
waiting state caused by out-of-sequence instruction executions.
To understand the operational principles of pipeline computation, we illus-
trate the design of a pipeline floating-point adder in Figure 3.2. This pipeline is
linearly constructed with four functional stages. The inputs to this pipeline are
two normalized: floating-point numbers:
A=a
x 2P
(3.3)
B=
b x 24
where a and 6 are two fractions and p and q are their exponents, respectively. For
simplicity, base 2 is assumed. Our purpose is to compute the sum
C=A+B=ex2=dx 2 BA
where x = max(p, gq) and 0.5 < d < 1, Operations performed in the four pipeline
stages are specified below:
1. Compare the two exponents p and q to reveal the larger exponent r = max(p, g)
and to determine their difference ¢ = |p — q].
2. Shift right the fraction associated with the smaller exponent by r bits to equalize
the two exponents before fraction addition,
3, Add the preshifted fraction with the other fraction to produce the intermediate
sum fraction e, where 0 < ¢ < 2.
4, Count the number of leading zeros, say wu, in fraction c and shift left c by u
bits to produce the normalized fraction sum d = ¢ x 2", with a leading bit f.
Update the large exponent s by subtracting s = r — u to produce the output
exponent.
The comparator, selector, shifters, adders, and counter in this pipeline can
all be implemented with combinational logic circuits. Detailed logic design of
these boxes can be found in the book by Hwang (1979). Suppose the time delays of
the four stages are t, = 60ns, t2 = 50ns, 13 = 90 ns, and t, = 80ns and the
interface latch has adelay of t, = 10 ns. The cycle time of this pipeline can be chosen
to be at least t = 90 + 10 = 100 ns (Eq. 3.1). This means that the clock frequency
of the pipeline can be set to f = l/r = 1/100 = 10 MHz. If one uses a non-
pipeline floating-point adder, the total time delay will be t, + 12 #173 + 7, =
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 149
Axax 2? B=bhx24
p a q b
Stages: Other i [
fraction Fraction
5 Exponent selector
l subtractor
Fraction with min( p, g)
1 Right shifter
r=max(p, )
#=|p-4
y y
Zz A Z
5, Fraction
adder
yr ¥°
A “A iz BL.
L
d
(Normalized
fraction)
3, Exponent
adder
£
d
C#dx2=A+B
Figure 3.2 A pipelined floating-point adder with four processing stages.
WWW.Gitmgurgaon.blogspot.com
{50 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Majn memory
(multiway interleaved} Memory
hierarchy
Pipeline stages:
Instruction update PC and check interrupt,
unit instruction fetch,
(instr, /+K +1) instruction decode,
(/ unity operand addr. calculation,
operand fetch
f
(instr, +5)
Pipelined ° FIFO
central . : instruction queue
ne (Instr. 7+2) (ready for execution)
(instr, f+ 1)
y
Execution Arithmetic
ij
unt
na i logic
and
(Instr. f)
pipelines
(E unit)
280 ns. In this case, the pipeline adder has a speedup of 280/100 = 2.8 over the non-
pipeline adder design. If uniform delays can be achieved in all four stages, say 70 ns
per stage (including the latch delay), then the maximum speedup of 280/70 = 4 can
be achieved.
The central processing unit (CPU) of a modern digital computer can generally
be partitioned into three sections: the jastruction unit, the instruction queue, and
the execution unit. From the operational point of view, all three units are pipelined,
as illustrated in Figure 3.3. Programs and data reside in the main memory, which
usually consists of interleaved memory modules. The cache is a faster storage of
copies of programs and data which are ready for execution, The cache is used to
close up the speed gap between main memory and the CPU.
The instruction unit consists of pipeline stages for instruction fetch, instruction
decode, operand address calculation, and operand fetches (if needed), The instruc-
tion queue is a first-in, first-out (FIFO) storage area for decoded instructions and
fetched operands. The execution unit may contain multiple functional pipelines
for arithmetic logic functions. While the instruction unit is fetching instruction
i+ K +1, the instruction queue holds instructions J + 1, 1+ 2,...,/ +.K,
and the execution unit executes instruction 7. In this sense, the CPU is a good
example of a linear pipeline. We will describe the detailed design of a pipeline
CPU for instruction execution and arithmetic computations in Section 3.3.
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 151
After defining the clock period and speedup in Eqs. 3.1 and 3.2, we need to
introduce two related measures of the performance of a linear pipeline processor.
The product (area) of a time interval and a stage space in the space-time diagram
(Figure 3.10) is called a time-space span. A given time-space span can be in either
a busy state or an idle state, but not both. We use this concept to measure the per-
formance of a pipeline.
Efficiency The efficiency ofa linear pipeiine is measured by the percentage of busy
time-space spans over the total time-space span, which equals the sum of all busy
and idie time-space spans. Let n, &, t be the number of tasks (instructions), the
number of pipeline stages, and the clock period of a linear pipeline, respectively.
The pipeline efficiency is defined by
_ meet
4 ~~ kefko ttn — Dt] kt fa — 1) (3.5)
Note that 7 > 1 as n-» co, This implies that the larger the number of tasks
flowing through the pipeline, the better is its efficiency. Moreover, we realize that
n = S,/k from Eqs. 3.2 and 3.3. This provides another view of the efficiency of a
linear pipeline as the ratio of its actual speedup to the ideal speedup &. In the steady
state ofa pipeline, we have n > k, the efficiency » should approach 1. However, this
ideal case may not hold all the time because of program branches and interrupts,
data dependency, and other reasons to be discussed in Section 3.2.
Pie ee we
we ee (a— rt (3.6)
where n equals the total number of tasks being processed during an observation
period kt + (4 ~ 1). In the ideal case, w = 1/t = fwhen n > 1. This means that
the maximum throughput of a linear pipeline is equal to its frequency, which cor-
responds to one output result per clock period. We will further evaluate the per-
formance of pipeline processors in Section 4.4.4.
WWW.Gitmgurgaon.blogspot.com
152 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Instructions
1
Functional
units
up to 26 stages per pipe in the Cyber-205, These arithmetic logic pipeline designs
will be studied subsequently,
Processor pipelining This refers to the pipeline processing of the same data stream
by a cascade of processors (Figure 3.4c), each of which processes a specific task.
The data stream passes the first processor with results stored in a memory block
which is also accessible by the second processor. The second processor then
passes the refined results to the third, and so on. The pipelining of multiple
processors is not yet well accepted as a common practice.
Unifunction vs. multifunction pipelines A pipeline unit with a fixed and dedicated
function, such as the floating-point adder in, Figure 3.2, is called unifunctional.
The Cray-1 has 12 unifunctional pipeline units for various scalar, vector, fixed-
point, and floating-point operations. A multifunction pipe may perform different
functions, either at different times or at the same time, by interconnecting different
subsets of stages in the pipeline. The TI-ASC has four multifunction pipeline
precessors, each of which is reconfigurable for a variety of arithmetic logic
operations at different times.
Static vs. dynamic pipelines A siatic pipeline may assume only one functional
configuration at a time. Static pipelines can be either unifunctional or multi-
functional. Pipelining is made possible in static pipes only if instructions of the
same type are to be executed continuously. The function performed by a static
pipeline should not change frequently, Otherwise, its performance may be very
low. A dynamic pipeline processor permits several functional configurations to
exist simultaneously. In this sense, a dynamic pipeline must be multifunctional.
On the other hand, a unifunctional pipe must be static. The dynamic configuration
needs much more elaborate control arid sequencing mechanisms than those for
static pipelines. Most existing computers are equipped with static pipes, either
unifunctional or multifunctional.
Scalar ys. vector pipelines Depending on the mstruction or data types, pipeline
processors can be also classified as scalar pipelines and vector pipelines. A scalar
pipeline processes a sequence of scalar operands under the control of a DO loop.
Instructions in a small DO loop are often prefetched into the instruction buffer.
The required scalar operands for repeated scalar instructions are moved into a
data cache in order to continuously supply the pipeline with operands. The [BM
WWW.Gitmgurgaon.blogspot.com
154 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
5 Feedback
s, connections
onnections
Multiplexer
.
Output (A} r
5;
Y
Output (8)
Time
bo fy
Silos A A
(0) Reservation
5, A A table for
function A
3, A A A
155
WWW.Gitmgurgaon.blogspot.com
156 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
called the evaluation time for the given function. A reservation table represents the
flow of data through the pipeline for one complete evaluation ofa given function.
A marked entry in the (i, /)th square of the table indicates that stage 5, will be
usedj time units after the initiation of the function evaluation. For a unifunctional
pipeline, one can simply use an “ x ” to mark the table entries. For a multifunctional
pipeline, different marks are used for different functions, such as the 4’s and B's
in the two reservation tables for the sample pipeline. Different functions may have
different evaluation times, such as 8 and 7 shown in Figure 3.56 and 3.5c for func-
tions A and B, respectively.
The data-flow pattern in a static, unifunctional pipeline can be fully described
by one reservation table. A multifunctional pipeline may use different reservation
tables for different functions to be performed. On the other hand, a given reserva-
tion table does not uniquely correspond to one particular hardware pipeline. One-
may find that several hardware pipelines with different interconnection structures
can use the same reservation table.
Many interesting pipeline-utilization features can be revealed by the reserva-
tion table. It is possible to have multiple marks in a row or in a column. Multiple
marks im a column correspond to the simultaneous usage of multiple pipeline
stages. Multiple marks in a row correspond to the repeated usage (for marks in
distant columns) or prolonged usage (for marks in adjacent columns) of a given
stage. It is clear that a general pipeline may have multiple paths, parallel usage of
multiple stages, and nonlinear flow of data,
in order to visualize the flow of data along selected data paths in a hardware
pipeline for a complete function evaluation, we show in Figure 3.6 the snapshots of
eight steps needed to evaluate function A in the sample pipeline. These snapshots
are traced along the entries in reservation table 4. Active stages in each time unit
are shaded. The darkened connections are the data paths selected in case of
multiple path choices. We will use reservation tables in subsequent sections to
study various pipeline design problems.
ws
—t-
j ’ Y Y
Figure 3.6 Eight snapshots of using the sample pipeline for evaluating the A function in Figure 3.54.
157
WWW.Gitmgurgaon.blogspot.com
158 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Memory
modules
Mo
Pipeline |
Pipeline 4
Instruction
128 decoder
The S access memory organization One of the simplest memory configurations for
pipeline vector processors uses low-order interleaving and applies the higher
(n — mt) bits of the address to all AY = 2" memory modules simultaneously in
one access, The single access returns Af consecutive words of information from
the Af memory modules. Using the low-order m address bits, the information
from a particular modules can be accessed. This configuration, which is shown in
Figure 3.84, is called S access because all modules are accessed simultaneously.
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 159
Data
latch
Module 6
Modute |
Single
Multiplexer ete word
access
| eee {Selector
ay
m low-order
address bits
eee Read-write
4 j control
n— rm high-order
address bits
e e
* *
e e
Module 6 Access | i Access 2 r eee
ord 0 Word W-1 Word 0 Word M-1
h—+ + + | + 4
Ay
Output y wv
From access | From access 2
Y
WWW.Gitmgurgaon.blogspot.com
160 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
A data latch is associated with each module. For a fetch operation, the information
from each module is gated into its latch, whereupon the multiplexer can be used
to direct the desired data to the single-word bus. Figure 3.8) depicts the timing
diagram for sample multiple-word read accesses using the S-access configuration.
Notice that with a memory-access time 7, and a latch delay of 7, the time to access
a single word is 7, + t. However, the total time it takes to access & consecutive
words in sequence, starting in module i, is T, + kt ifi + & < M, otherwise it is
27,+( +k — M)r, In both cases, 1 < k < M. For effective access of long
vectors, Mr < T,; otherwise, there would be a data overrun, S-access configura-
tion is ideal for accessing a vector of data clements or for prefetching sequential
instructions for a pipeline processor. It can also be used to access a block of
information for a pipeline processor with a cache.
When nonsequentially addressed words are requested, the performance of
the memory system deteriorates rapidly. To provide a partial remedy for non-
sequential accesses, some concurrency can be introduced into the configuration
by providing an address latch for each memory module so that the effective
address cycle (hold time) ¢, is much smaller than the memory cycle time r,. Since
the address is typically held on the address bus at least as long as data is held on the
data bus, the data buses do not pose a limiting constraint on the performance.
By providing the address latch, the group of M modules can be multiplexed on an
internal memory address bus, called a bank or a line as to be studied in Chapter 7.
Gy —f
Decoder]
a, a,
M-~I
Access M
Pad
*
2 Access 3 :
it 4 Access 2 : Access Af +2 : |
F 7 : T |
0 Access | Lo Access M+ 1 Lg
1
Time
prime, the elements can be accessed at the maximum rate of T,/M per word. It is
obvious that the S access configuration will perform worse for such address se-
quences. In the § access scheme, an address sequence which is generated with a
skip distance of d has an average data rate of dT,/M when d < M, and of T,,
when d > M.
The storage scheme for a vector can be extended to two and higher dimensional
arrays. As an example, consider a two-dimensional array A[O: R — 1,0;C — 1}.
The elements can be mapped into a one-dimensional vector [0: s — 1], in two
basic ways: row-major form or column-major form. In row-major form, the index
element Afi,/] in the vector V is given by the iC +. Similarly, the index of
Ai, jJincolumn-major formisjR + i The storage scheme for the two-dimensional
array can then be derived from the storage schemes for VM, as described earlier.
WWW.Gitmgurgaon.blogspot.com
162 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Module A
number
ata ¥[I4I |
6 r 7 T
4 i vial , | FLL]
5 v0101 Yl,
VIB] —
Output Vilfori=¥%O Y2 ¥4 YS YS YO We Lae,
Time
(a) Skip distance, d=2
Module
number
: VUI5] V[39] ;
‘ ' Vi6l __¥130) 1
* T
FRU vias),
4 VEL] : » VBA , :
A __¥BI veN
3 : MUL) | HY42] ;
VE VE3E
'_ ¥{G a :
0 rt mo rr
VUi] for sé = 202326. 9:12:15: 18/21 24:27:30.
33:36 39 142 145
Output
a
Time
WWW.Gitmgurgaon.blogspot.com
164 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Before studying various pipeline design techniques and examining vector process-
ing requirements, we need to understand how instructions can be overlapped,
executed, and how repeated arithmetic computations can be done with pipe-
lining. Instruction pipelining is illustrated with the designs in the IBM 360/91.
Arithmetic pipelining will be studied in detail with four design examples for
multiple-number addition, floating-point addition, multiplication, and division.
Finally, multifunction-pipeline designs and array pipelining for matrix arithmetic
will be intreduced.
Main storage
control unit
Memory address4 & Data [instruction data
Dispatched] instructions
f y , ¥
f
as
\ ‘
Data
Figure 3.12 The centra! processing unit (CPU) of IBM System 360/Model 91.
A block diagram of the CPU in the IBM 360/91 is depicted in Figure 3.11. It
consists of four major parts: the main storage contral unit, the instruction tunit, the
Jixed-point execution unit, and the floating-point execution unit. The instruction
unit (E unit) is pipelined with a clock period of 60 ns. This CPU is designed to
issuc instructions at a burst rate of one instruction per clock cycle, and the perform-
ance of the two execution units (E units) should support this rate. The storage
control unit supervises information exchange between the CPU and the main-
memory. Major functions of the I unit includes instruction fetch, decade, and de-
livery to the appropriate E unit, operand address calculation and operand fetch.
The two E units are responsible for the fixed-point and floating-point arithmetic
logic operations needed in the execution phase.
From memory access to instruction decode and execution, the CPU is fully
pipelined across the four units shown in Figure 3.11. Concurrency among suc-
cessive instructions in the Model 91 is illustrated in Figure 3.12, It is desirable
to overlay separate instruction functions to the greatest possible degree. The
shaded boxes correspond to circuit functions and the thin lines between them refer
to delays caused by memory access, Obviously, memory accesses for fetching
either instructions or operands take much longer time than the delays of functional
circuitry. Following the delay caused by the initial filling of the pipeline, the execu-
tion results will begin emerging at the rate of one per 60 ns.
For the processing of a typical floating-point storage-to-register instruction,
we show the functional segmentation of the pipeline in Figure 3.13 along with the
clock-time divisions. The basic time cycle accommodates the pipelining of most
hardware functions. However, the memory and many execution functions require
a variable number of pipeline cycles. In general, these storage and execution
functions require a large portion of time cycles, as revealed in Figure 3.13. Alter
decoding, two parallel sequences of operation may be initiated: one for operand
WWW.Gitmgurgaon.blogspot.com
166 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Time
1 EA ZB KEE
~~ GC, b- be Dm be Em
Lm f, access tm O, access Fe R,
le ZB a LG
—4 G, hk be Dm be El
“ EZ ZZ ZZ
+> G, e Dam pe Ey
f; instruction f
E: execute f,
Re result ¢
Figure 3.12 Concurrency among successive instruction fetch-decode-execute in the IBM 360/91.
access and the other for the setup of operands to be transmitted to an assigned
execution station in the selected arthmetic unit. The effective memory access time
must match the speeds of the pipeline stages.
Because of the time disparities between various instruction types, the Model
91 utilizes the organizational techniques of memory interleaving, parallel arith-
metic functions, data buffering, and internal forwarding to overcome the speed
gap problems. The depth of interleaving is a function of the memory cycle time,
the CPU storage request rate, and the desired effective-access time. The Model
91 chooses a depth of 16 for interleaving 400 ns/cycle storage modules to satisfy
an effective access time of 60 ns. We will examine pipeline arithmetic and data-
buffering techniques in subsequent sections.
Concurrent arithmetic executions are facilitated in the Model 91 by using two
separate units for fixed-point execution and floating-point execution. This permits
instructions of the two classes to be executed in parallel. As long as no cross-unit
data dependencies exist, the execution does not necessarily flow in the sequence
in which the instructions are programmed. Within the floating-point E unit are
an add unit and a multipiy/divide unit which can operate in parallel, Further-
more, pipelining is practiced within arithmetic units, as will be described in Section
167
"16/09C IVA] 242 UT UonsAsuE jufOd-Sugeoy 4019]F01-07-aFes0)s jeotdd; & yo UDLIUAHABES eUOHOUAY CIC ant
+ ai) 349 autyodig
uoHoUny
uenoury SUEOTJOUTEE yt sudtpuny jun
WWW.Gitmgurgaon.blogspot.com
| Worpnsaxa wOIAA stl juIGd-ZuNeoy] wonouny asBlols pure juononysuy
quod noe na neenannn ne cne ees SUQIIUN] SFEIGIS PUBS ts terre
en eee hun wonanaysuy HUT FONUOS
| -BUnRGLY HEN jO1U09 JFeIO1s URW 5 ° 2BRIO1S ULE
(Q-F EL 49) €1 zi HL ol 6 g L § P £ z 1 uaurgag
Tur SIEM PIE! HORNDaxo
SIPMpIeYy
pursado
JO, Wey
onaw mae
”* ° crest
ronnoaxs
. ° 3poosp guneo
: a Bose} se reenetee
vat BOWPROIKS Winjat jun 1 on lot 1 BuneoLy : HOT} 2porap] SsoippE
onnaa%a 01 | puesado byowyy aposep [WOH oh SUE WOHaRAsH! -onuysett oO 583908 uoHAusul
HOHSNTSUT puriado | a8eioig f..-....-..--LUSHMesAL AOI ee apooaq) uononarsuy| BOHPRISUT aressuan
oo ceee cece eee Huasuea (UOT BINp atqeues) pursado SAOW]
| uonemp ssaao8 pUtiadO p1eauaD LonEinp
sIQUye A aque,
| se oq Fl f ef 4 2 if 68k EC | 8 bg | 9 | $ | * bf € fF @ ft 1 [Saqa49
ory
*16/09¢ INGE JO autpadid Wopsruysue ayy UL payenpysd UO|IUAY A[dppNYy v LO ppy ]eoIddy & 10] JIQUS BOREAIASI AY PLE andy
vel
rel
e
el
x I
hada x lE
x Ol
x 16
x g
L
9
P
$sa0d8 punladg [|e
& x x z aes
x i
$1 Fl €i Zi H Or 6 8 1 Wauidag
ULL
168
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 169
3.2.2. Figure 3.14 shows a reservation table for the instruction pipeline in Figure
3.13, At stage 13, the path to follow depends on the instruction types, one using
the floating-point adder and the other using the floating-point multiplier-divider.
The adder requires two cycles and the multiplier requires six cycles.
The [ unit in the Model 91 is specially designed (Figure 3.15) to support the
above pipeline operations. A buffer is used to prefetch up to eight double words
of instructions. A special controller is designed to handle instruction-fetch, branch,
and interrupt conditions. There are two target buffers for branch handling.
Sequential instruction-fetch branch and interrupt handling are all built-in hard-
ware features. After decoding, the | unit will dispatch the instruction to the fixed-
point E unit, the floating-point E unit, or back to the storage control unit. For
memory reference instructions, the operand address is generated by an address
adder. This adder is also used for branch-address generation, if needed, The
performance of a pipeline processor relics heavily on the continuous supply of
Branch
’ Y
[Tareet buffer | 000
001
a0 .
Target buffer3 rol rstruc From fixed-point
a reer execution unit
~ too Duffer (FXEU)
Bray .
PSW New PSW Tor
r+ to io. — From float-peint
¥ | — execution unit
11] (FLEU)
Contrals psw
|. I fetch
2. Branch t
3. Interrupt
[ register
CBRA
General registers ,
t f
Availability
I decode
Address Ba
adder bt bed Operand buffer |
| |
To MSCU ¥ FLEU
To MSCU FXEU
CBRA: conditional branch recovery address
PSW: present state word
Figure 3.15 The instruction unié (1 unit) in IBM 360/91 CPU.
WWW.Gitmgurgaon.blogspot.com
170 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
instructions and data to the pipeline. When a branch or interrupt occurs, the
pipeline will lose many cycles to handle the out-of-sequence operations. Tech-
niques to overcome this difficulty include instruction prefetch, proper buffering,
special branch handling, and optimized task scheduling. We will study these
techniques in Section 3.3 and check their applications in real-life system designs
in Chapter 4.
a.
x) b 8 b, b, 2
Geb APy ayBy AyD) G,Dy Mydg = HY
ab, ab, ayb, ab, ab, aby =W,
ash, ayby ayby ayby a,b, agby = W,
A = 121110 1
B = 61011 0
+) D = 110411 41
Cc z#lo1011 41
+) s = O11 1 0 0
A+B+D
or C+S5=1000i106010
NW - | .
N@) = "| x 3-+ N(v — 1} mod 2 with N(1) = 3 (3.9)
For example, one needs a [0-level CSA tree to add 64 to 94 numbers in one pass
through the tree. In other words, a pipeline with 10 stages on the CSA tree is needed
to multiply two 64-bit fixed-point numbers in one pass. The floor notation Lx|
refers to the largest integer not greater than x.
WWW.Gitmgurgaon.blogspot.com
172 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
A a
Ze io oi ZAI.
re hs oy yh yr
ie ZA.
Y a — 4 r
5; CSA, CSA,
yey oy
— te,
A, LAbe
3, CSA, three-level
: CSA
c ‘s tree
f 7
z Zab.
-_—___ 1 f—
Sy CSA,
C s
f y
Z TAL
f
5, CPA
1
ee Ye LS
PaAxB
Figure 3.17 A pipelined multiplier built with a CSA tree,
A B
5,
aA Shifted multiplicand generator
ce
wi
+2
i+] ¥ . Way Ww ¥ y
Ve idaL
’ YY Y ¥ ¥
S, CSA, CSA,
Po te
49
Y y Y Y
Kz WL ab
AA
Sy CSA,
on
¥ y ‘
EZ PAbe
5, CSA,
_— $ o-—.
Y¥ y
i AL.
¥ r
5, CPA
Y
KZ AL.
fpr xB
Figure 3.18 A pipelined multiplier using an interative CSA tree for multiple-shife multiplication.
26 clock periods, out of which 24 cycles are needed in the iterative CSA-tree
hardware,
This iterative approach saves significantly in hardware compared to the single-
pass approach, As a contrast, one-pass 32-input CSA-tree pipeline requires the
use of 30 CSAs in eight pipeline stages. The increase in hardware is 26 additional
CSAs (each 32-bits wide). The gain in total evaluation time is the saving of
WWW.Gitmgurgaon.blogspot.com
174 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
ty9 "10
bin "11
tay bg12 bog
S13 Fag
14 beg
“1s big
th be
“1? F ik f ry 20 fat fy faa fog 495 6
5 i
§. 2
§
Sy
55
Figure 3.19 The reservation table for the pipelined multiplier in Figure 3.18.
The Add unit allows three pairs of numbers to be loaded into the three reserva-
tion stations. The M/D unit has two reservation stations for two pairs of numbers.
The pipeline floating-point adder in the Add unit has physically segmented into
two stages. Logically, this pipeline adder can be separated into three algorithmic
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 275
To storage
via store
data buffers . To FX PT
From
storage
. Instr unit
FE ee fee e ect cote pee tee cee tafe .
! { :
Floating- i
i point !
i ! OP stack :
| Floating- Floating- (FLOS) i
: point point 8 x I4 i
: registers buffers Controls :
i (FLR) (FLB) 7 ;
Ax 72 6 x72 4 |_Execution units i
4
i Controls }
be pe ee beefs
cee ceneneeesroneneens j
FLR bus
FLB bus
J
LCommon
data bus J
APDunit op fp. ~f..
th
dd
1 pt
; '
t, Soa 4, : bench
tt tf
i LRES Stac}] [RES Stat 2] [RES Stara] | | RES Stat 1] [RES Stat 2] |
Al Az A3 iyi
iyi M/D 2
i | IM/D I
wo-stage Pye Multiply
floating- iyi iteration
point i : unit
Ladd iy]
! Pipeline : i Propagate
; ! adder
: | Result i i | Result i
i]t
Lec e ee ect cen en enna epee ence ceecn ene eeeeees BP hee ceeeeesee Pec eceeeseeeeened
WWW.Gitmgurgaon.blogspot.com
176 COMPUTER ARCHITECTURE ANT? PARALLEL PROCESSING
:
BA /
KH
be he
bybe
belfoe
byhe
aie
oH
i
P
|
ee]
L
edol
be] | of
1 | ed
ed]
oe]
wf
Lae Pe Characteristic
comparison
CDANC, — Cy Adder Cc and preshifting
— (CCP)
Digit
PCDI oreshifter
IG
adder
“1, Fraction adder
a +
fF
pou
Zero
— diet
Characterist! ZDC igit
update check —_ Postnormalization
er pcp Post
nl shifter
Result i
Characteristic
Fraction
stages, as depicted in Figure 3.21. The functions of these three sections are very
similar to what we have discussed in Figure 3.2. Exponent arithmetic and fraction
addition and subtraction can be done in parallel. The fraction adder is 56 bits
wide and the two exponent adders are each 7 bits wide. Both normalized and
unnormalized instructions can be executed, as listed in Table 3.1. The two-cycle
speed for double-precision (long-format) floating-point addition matches the
instruction-issuing rate of the CPU.
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 177
Floating-point multiply and divide share the same hardware M/D unit in the
Model! 91. Multiplier recoding techniques are used to speed up the multiplication
process. Six multiplicand multiples are generated after the recoding. The complete
pipeline structure of the M/D unit is shown in Figure 3.22. The hardware resources
can be separated into two parts: the iterative hardware for multiple multiplicand
addition through a CSA tree, as shown within the dashed-line box, and the periph-
eral hardware for input reservation, prenormalization, multiplier recoding, expo-
nent arithmetic, carry propagation, and output storage, A quadratic convergence
division method is applied to generate Q = N/D through dual sequences of the
multiplication of N and D by a series of converging factors until the denominator
converges to unity. The resulting numerator becomes the desired quotient.
Therefore, the aformentioned iterative-multiply hardware can do the job without
additional facilities.
The convergence division method has been implemented in many models of
the IBM 360/370 and in the CDC 6600/7600 systems. The method is briefly
described below. We want to compute the ratio (quotient) @ = N/D, where N is
the numerator (dividend) and D is the denominator (divisor). Consider normalized
binary arithmetic in which 0,5 < N < D < | to avoid overflow. Let R; for i =
1, 2,..., be the successive converging factors. One can select
where
6 = | ~ DandO
<6 < 065.
WWW.Gitmgurgaon.blogspot.com
178 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Multiplier
Multiplicand recoding
Peripheral
, RAAAA
hardware
Multiplicand multiple generator
ye ts ys ph ye
LALe
v cr LL
¥ '
CSA CSA
Y Y -—
CSA CSA
merge<
tree —
y
GZ “Ak.
7 yy
CSA
NM
oo {
CSA
(yy... |
la “Abe
Iterative
hardware ¥_) ¥
N NxR,x Ryx--- x R,
C= 5“ DKR, wR, xR
Nx Q+dx +8) ---x + 75
saxUseyx xe oy (3.10)
“Tax
whereD = 1 — dis being substituted. DenoteD x Ry x Ry x --- x R,; = D,and
Nx R, x R,x-+-- x Rp= N,fori= 1,2,...,k. We have
D,=(1 — BL + dU + 5) + FF’)
@) + 6700 + 6+ BY
=(1—
= 1-67
——__———"_]
ultiple
Ml4 Ml 7
[Mal Multipl
rating PS] [ey |
1
Lower
f CSA-C | { Latch i half
To carry-save
To adder
loop
Carry propagate
ADDER
Result latch
DIVIDE LGOP
Figure 3.23 Convergence divide loop using the iterative hardware in IBM 360/91 floating-point
MULTIPLY /DIVIDE unit. (Courtesy of International Business Machines Corp.)
WWW.Gitmgurgaon.blogspot.com
180 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
It is clear that 05< D< D, < D, <.--< D,-» 1, because of the fact 0.5 >
b> 6 > 6 >---> 6 > 0 When the number of iterations & is sufficiently
large, 67" + O and thus D, > I. We end up with
N= Nx 1+ dK +e) %-- e+ 8) (3.11)
which equals the desired quotient @ = N/D = N,. The smaller the fraction 4, the
faster will be the convergence process.
The multiply hardware in the M/D unit can be used iteratively to carry out the
above convergence division of two 56-bit fractions in the Model 91. Figure 3.23
shows the divide loop when utilizing the iterative hardware in Figure 3.18 for
convergence division. A time chart is given in Figure 3.24 to show the two over-
lapped sequences of multiplications (Eq. 3.9) carried out simultaneously by the
upper half and the lower half of the divide loop. Five iterations are needed (k = 5)
to converge the numerator into the desired quotient, the factor 6°” = 63? becoming
small enough to be considered as zero within the limit of machine precision. In
the M/D unit, 12 bits are being shifter per iteration by the andriplier recoding logic.
The theory of multiplier recoding using redundant number representation can be
found in Hwang’s book on Computer Arithmetic (1979).
Concurrency in arithmetic operations has been exploited in the IBM 360/91
in four areas:
1. Concurrent operations of the Add unit and the M/D unit within the floating-
point E unit
wr ag Se ae”
x x XXX, x xX Upper half of divide loop
D219 1% QS 1 ze
wr gg ERE RE ae ae -
ex RO Lower half of divide loop
ie 19% (9 = 18 |z
Figure 3.24 Timing chart showing the overlapped execution in the divide loop shown in Figure 3,23,
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 181
The hardware examines multiple instructions and optimizes the program execu-
tion by allowing simultaneous execution of multiple independent instructions.
c |"
B
Arithmetic cell
{A cell}
x > >
F a ee F
C,= (+ XA 4+ 0)+ AC
yt ° ‘ i b<—— C,
D= BC + CF
E=84+ CF
IN
Control cell
WWW.Gitmgurgaon.blogspot.com
182. COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Q, 2, 2, 5, 5 s3 8, S, s6 AY ‘t
Figure 3.26 A four-function arithmetic pipeline. (Courtesy of JEEE Trans. Computers, Kamal et al. 1974.)
The schematic design of the four-function pipeline with A cells, K cells, and
interface latches is shown in Figure 3.26. Arithmetic computations being imple-
mented are specified below in terms of input and output relationships. A, B, and
P jines are for operand inputs. All pairs of B and C lines can be tied together
(with the same input value) except for the sqrt operation. K and X are function
control signals. S and Q lines are for outputs. In the following arithmetic I/O
equations, all unspecified input lines assume a zero input value (unless otherwise
noted). The control signal ¥ = 1 is for divide and sqrt operations.
Multipty operation
{4,A2
43 44) + (B, 82 Bs) x (Py P2) = G1825354) (3.12)
Divide operation
(A,
A, A344) — (8, Bz Bs) = (0,02 G3) plus (S35455) (3.13)
Quotient remainder
Squaring operation
WWW.Gitmgurgaon.blogspot.com
184 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Receiver
—4
Multiply
! [
4 Accumulate
___]
r Exponent +
subtract
¢ .
4 Align
Add
I
Normalize ph
Cutput
Figure 3.27 All possible interstage connections
} in the TI-ASC arithmetic pipeline. (Courtesy of
Stephenson 1973.)
; Receiver
} fb Receiver
fy Receiver
ff Receiver
y y ¥ Lj
MULTIPLY MULTIPLY
7
ACCUMULATE
rf u
EXPONENT EXPONENT
SUBTRACT SUBTRACT
f j
ALIGN ALIGN
¥ y ‘ y ‘ f q y
¥ f
NORMALIZE NORMALIZE
4 f '
Output Output Output Output
and L-U decomposition. The pipeline is usually constructed with a cellular array of
arithmetic units. The cellular array is usually regularly structured and suitable for
VLSI implementation. Presented below is only an introductory sample design of
an array pipeline. This array is pipelined in three data-flow directions for the re-
peated multiplication of pairs of compatible matrices, The basic building blocks
in the array are the M ceils. Each M cell performs an additive inner-product
operation as illustrated in Figure 3.29.
WWW.Gitmgurgaon.blogspot.com
186 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
5
an sonenstnsesamnet Cr
nes nenunenetntete Si
tee te tnscenmementsntnemeeg
: oy
!
tty :
by, 0
2 Cy
by Oxy :
if
by; Oy, :
i
oO Py
0 6
ar ap a, 9 0 4
a d 0 0 a, 1 a 32 a5,3 t3
b b
¢
a
d=arb+e
Figure 3.29 A cellular array for pipelined multiplication of (we dense matrices.
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 187
Each M cell has the three input operands a, b, and c and the three outputs
a =a,b' = b,andd=a x b +c. Fast latches (registers) are used at all input-
output terminals and all interconnecting paths in the array pipeline. All latches
are synchronously controlled by the same clock. Adjacency between cells is defined
in three orientations: horizontal, vertical, and diagonal (45°) directions. The array
shown in Figure 3.29 performs the multiplication of two 3 x 3 dense matrices
A-B=C,
Gy; Gyn ays Bi, by, By3 Crp Cia C13
A-B= Jaz, G32 Ga3}-]b2, 622 Basf = Jar Caz C23) = C (3.16)
43, G32 Ags bs, by. bas Car C32 C33
The input matrices are fed into the array in the horizontal and vertical direc-
tions. Three clock periods are needed for inputing the matrix entries: one row ata
time for the A matrix and one column at a time for the B matrix. Dummy zero
inputs are marked at unused input lines. “Don’t care” conditions at the output
lines are left blank. In general, to multiply two (1 = n) matrices requires 3n* —
4n + 2 M cells. It takes 3n ~ 1 clock periods to complete the multiply process.
When the matrix size becomes too large, the global array approach will pose a
serious problem for monolithic chip implementation because of density and I/O
packaging constraints. For a practical design of array pipelines, a block-partition-
ing appreach will be introduced in Chapter 10 for VLSI matrix arithmetic. VLSI
array-pipeline structures will be treated there with the potential real-time
applications.
Key design problems of pipeline processors are studied in this section. We begin
with a review of various instruction-prefetch and branch-control strategies for
designing pipelined instruction units, Data-buffering and busing structures are
presented for smocthing pipelined operations to avoid congestion. We will study
internal data-forwarding and register-tagging techniques by examining instruction-
dependence relations. The detection and resolution of logic hazards in pipelines
will be described. Principles of job sequencing in a pipeline will be studied with
reservation tables to avoid collisions in utilizing pipeline resources. Finally, we
will consider the problems of designing dynamic pipelines and the necessary
system supports for pipeline reconfigurations.
WWW.Gitmgurgaon.blogspot.com
188 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Instruction fetch 6 6 6 6 6
Decode 2 2 2 2 2
Condition test ] I
Operand address
calculation 2 2 2 2 2
Operand fetch{es) 6-12
Arithmetic logic
execution 4-8
Store result 6
Update PC and flags I 1 i I 1
Total pipeline cycles 21-31 iy I 12 12
operations which require one or iwo operand fetches, The execution of different
arithmetic operations requires a different number of pipeline cycles. The stere-type
operation does not require a fetch operand, but memory access is needed to store
the data. The branch-type operation corresponds to an unconditional jump.
There are two possible paths for a conditional branch eperation. The yes path re-
quires the calculation of the new address being branched to, whereas the no path
proceeds to the next sequential instruction in the program. The arithmetic-load
and store instructions do not alter the sequential execution order of the program.
The branch instructions (25 percent in typical programs) may alter the program
counter (PC) in order to jump to a program location other than the next instruc-
tion, Different types of instructions require different cycle allocations. The branch
types of instructions will cause some damaging effects on the pipeline performance.
Some functions, like interrupt and branch, produce damaging effects on the
performance of pipeline computers. When instruction { is being executed,
the occurrence of an interrupt postpones the execution of instruction J + 1
until the interrupting request has been serviced. Generally, there are two types of
interrupts. Precise interrupts are caused by illegal operation codes found in
instructions, which can be detected during the decoding stage. The other type,
imprecise interrupts, is caused by defaults from storage, address, and execution
functions.
Since decoding is usually the first stage of an instruction pipeline, an interrupt
on instruction J prohibits instruction J + 1 from entering the pipeline. However,
those instructions preceding instruction / that have not yet emerged from the pipe-
line continue to run until the pipeline is drained. Then the interrupt routine is
serviced, An imprecise interrupt occurs usually when the instruction is halfway
through the pipeline and subsequent instructions are already admitted into the
pipeline. When an interrupt of this kind occurs, no new instructions are allowed to
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 189
enter the pipeline, but all the incompleted instructions inside the pipeline, whether
they precede or follow the interrupted instruction, will be completed before the
processing unit is switched to service the interrupt.
In the Star-100 system, the pipelines are dedicated to vector-oriented arith-
metic operations. In order to handle interrupts during the execution of a vector
instruction, special interrupi buffer areas are needed to hold addresses, delimiters,
field lengths, etc., that are needed to restart the vector instructions after an interrupt.
This demands a capable recovery mechanism for handling unpredictable and
imprecise interrupts.
For the Cray-1 computer, the interrupt system is built around an exchange
package. To change tasks, it 1s necessary to save the current processor state and
to load a new processor state. The Cray-1 does this semiautomatically when an
interrupt occurs or when a program encounters an exit instruction. Under such
circumstances, the Cray-1 saves the eight scalar registers, the eight address registers,
the program counter, and the monitor flags. These are packed into 16 words and
swapped with a block whose address is specified by a hardware exchange address
register. However, the exchange package does not contain all the hardware state
information, so software interrupt handlers must save the rest of the states. “The
rest” includes 512 words of vector registers, 128 words of intermediate registers,
a vector mask, and a real-time clock.
The effect of branching on pipeline performance is described below by a
linear instruction pipeline consisting of five segments: imstruction fetch, decode,
operand fetch, execute, and store results. Possible memory conflicts between
overlapped fetches are ignored and a sufficiently large cache memory (instruction-
daia buffers) is used in the following analysis.
As illustrated in Figure 3.30, the instruction pipeline executes a stream of
instructions continuously in an overlapped fashion if branch-type instructions
do net appear. Under such circumstances, once the pipeline is filled up with se-
quential instructions (nonbranch type), the pipeline completes the execution of one
instruction per a fixed latency (usually one or two clock periods).
On the other hand, a branch instruction entering the pipeline may be halfway
down the pipe (such as a “successful” conditional branch instruction) before a
branch decision is made. This will cause the program counter to be loaded with
the new address to which the program should be directed, making all prefetched
instructions (either in the cache memory or already in the pipeline) useless. The
next instruction cannot be initiated until the completion of the current branch-
instruction cycle. This causes extra time delays in order to drain the pipeline, as
depicted in Figure 3.26c. The overlapped action is suspended and the pipeline
must be drained at the end of the branch cycle, The continuous flow of instructions
into the pipeline is thus temporarily interrupted because of the presence of a
branch instruction.
In general, the higher the percentage of branch-type instructions in a program,
the slower a program will run on a pipeline processor. This certainly does not merit
the concept of pipelining. An analytical estimation of the effect of branching on an
#-segment instruction pipeline is given below. The instruction cycle is assumed to
WWW.Gitmgurgaon.blogspot.com
190 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
S, 5; 8, S, Ss
Fetch Fetch Store
te] insiruc- Decode Operands Execute --] results) pe
tion
(a)
0412 3 4 85 6 7 8 Y 10 1112 13 14 15 16 17 18 19 20 21 22
Lobb bop to pei ti) pid | ay Time
Ad
Ty heer
AL
|
|
|
AL
Ald
e
e
10 11 12 13 14 15 16 17 18 19 20 21 22
ty
I
oO
r—2
iw
ho
mw
—
even
include x pipeline cycles. For example, one instruction cycle is equal to five pipe-
line clock periods in Figure 3.30. Clearly, if a branch instruction does not occur,
the performance would be one instruction per each pipeline cycle. Let p be the
probability of a conditional branch instruction in a typical program (20 percent
by Table 3.2) and q be the probability that a branch is successful ($42 = 60 per-
cent by Table 3.2). Suppose thai there are m instructions waiting to be executed
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 191
through the pipeline. The number of instructions that cause successful branches
equals m- p-q. Since(n ~ 1)/n extra time delay is needed for each successful branch
instruction, the total mstruction cycles required te process these m instructions
equal (1/n)in + im ~ 1) + On-p- @(n — 1)/n. As mm becomes very large, the per-
formance of the instruction pipeline is measured by the average number of instruc-
tions executed per instruction cycle:
m n
lim = (3.17)
mow (tt Dn ep eg Gam Din 1 + pata ~ 1)
WWW.Gitmgurgaon.blogspot.com
192 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Memery system
(access time T)
wa
Sequential Target
prefetch buffer prefetch buffer
(s words) ({ words)
Decoder
(r time units)
‘1
[ 2 } Fxecution
pipeline
other discarded. This branch-target prefetch approach may increase the utilization
of the pipeline CPU and thus increase the total system throughput.
7, T, r.
7,27, T7,T,=3T
(a) Segment 2 is the bottleneck
5,
(rr ry
— > 5, > rf mS,
tT T r r T
8)
ere
te Sy ~ - Sy be
Tr r ar T
al
wa
in
y
Y
we
~~
~y
4
3T
WWW.Gitmgurgaon.blogspot.com
194 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Data and instruction buffers Another methad to smooth the traffic flow in a pipeline
is to use buffers to close up the speed gap between the memory accesses for either
instructions or operands and the arithmetic logic executions in the functional pipes.
The instruction or operand buffers provide a continuous supply of instructions or
operands to the appropriate pipeline units. Buffering can avoid unnecessary idling
of the processing stages caused by memory-access conflicts or by unexpected
branching or interrupts. Sometimes the entire loop’s instructions can be stored in
the buffer to avoid repeated fetch of the same instruction loop, if the buffer size is
sufficiently large. The amount of buffering is usually very large in pipeline
computers.
The use of instruction buffers and various data buffers in the IBM System/
360 Model 9! is shown in Figure 3.33. Three buffer types are used for various in-
struction and data types. Instructions are first fetched to the instruction-fetch
buffers (64 bits each) before sending them to the instruction unit (Figure 3.15).
After decoding, fixed-point and floating-point instructions and data are sent to
their dedicated buffers, as labeled in Figure 3.33. The store-address and data
buffers are used for continuously storing results back to the main memory. We
have already explained the function of target buffers for instruction prefetches.
The storage-conflict buffers are used only when memory-access conflicts are taking
place.
In the STAR- 100 system, a 64-word (of 128 bits each) buffer is used to tempor-
arily hold the input data stream until operands are properly aligned. In addition,
there is an instruction buffer which provides for the storage of thirty-two 64-bit
instructions. Eight 64-bit words in the instruction buffer will be filled up by one
memory fetch. The buffer supplies a continuous stream of instructions to be
executed, despite memory-access conflicts.
In the TI-ASC system, two eight-word buffers are utilized to balance the stream
of instructions from the memory to the execution unit. A memory buffer unit has
three double buffers, X, Y, and Z. Two buffers (X and Y) are used to hold the input
operands and the third (Z buffer) is used for the output results. These buffers
greatly alleviate the problem of mismatched bandwidths between the memory
and the arithmetic pipelines.
In the Floating-Point Systems AP-120B, there are two blocks of registers
serving as operand buffers for the pipeline multiplier and adder. In the Cray-1
system, eight 64-bit scalar registers and sixty-four 64-bit data buffers are used for
scalar operands. Eight 64-word vector registers are used as operand buffers for
vector operations. There are also four instruction buffers in the Cray-1, each con-
sisting of sixty-four 16-bit registers, With four imstruction buffers, substantial
program segments can be prefetched to allow on-line arithmetic logic operations
through the functional pipes.
(20-9 souryaeya] ssouysag jeuopyenazuy jo Asa}anoy)}
195
“youn wor Noexa Jupod-Zurlecy 14 [POT OGC/WAISAS IAAL 242 UT (Ad) Sig e2Ep UOMIAIED pur ‘suO]IeIS BORE AIESas ‘syjed JaysuBy ‘ssayAq BEG F¢'¢ [NBL
(aD) snq erep uom0D
iss] TRSSe
PPV
aplaip sayin
(or = dep)Sp PTHLO| mos | Be_ | yurg | BBL
(8=3e}) FYTHLO }eomesg j] geqz | yuig |} Bey Ci = Bez) "py |FHEO]acnes | Bey | yuIS sel
(6=3e1) FTyLD [eunog | ae_ | wus [ 3er Jw (cl = deny! py [PeLojaomos | ey | xargs | 2e1
ga5
WWW.Gitmgurgaon.blogspot.com
shg Y14
sig qa
L__(aas)
Z $49jjnq yep |sBey
€ og
[isporeq]
os -——
0 t
Z QUAD) susistder syiq <
S8elie.”
p iwied- Buel Asng (soTd) #835 jonw05
8 puwsado ¢ (ada) siayaq
} quod § nHod-duneo)4
-Suneoly 9
en go SMysu] sng adeI0ISg
196 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Busing structures Ideally, the subfunction being executed by one stage should be
independent of the other subfunctions being executed by the remaining stages;
otherwise, some processes in the pipeline must be halted until the dependency is
removed. For example, when one instruction waiting to be executed is first to be
modified by a future instruction, the execution of this instruction must be sus-
pended until the dependency is released. Another example is the conflicting use of
some registers or memory locations by different segments of a pipeline. These
problems cause additional time delays, An efficient internal busing structure is
desired to route results to the requesting stations with minimum time delays,
In the TI-ASC system, once instruction dependency is recognized, only
independent instructions are distributed over the arithmetic units, Update
capability is incorporated into the processor by transferring the contents of the
Z buffer to the X buffer or the Y buffer. With such a busing structure, time delays
due to dependency are significantly reduced. in the STAR-100 system, direct routes
are established from the output transmit segment to the input receive segment,
Thus, no registers are required to store the intermediate results, which causes a
significant saving of data-forwarding delays.
In the AP-120B or FPS-164 attached processors, the busing structures are
even more sophisticated, Seven data buses provide multiple data paths. The
output of the floating-point adder in the AP-120B can be directly routed back to
the input of the floating-point adder, to the input of the floating-point multiplier,
to the data pad, or to the data memory. Similar busing is provided for the output
of the floating-point multiplier. This eliminates the time delay to store and to
retrieve the intermediate results to or from the registers.
In the Cray-1 system, multiple data paths are also used to interconnect various
functional units and the register and memory files. Although efficient busing struc-
tures can reduce the damaging effects of instruction interdependencies, a great
burden is still exerted on the compiler to produce codes exposing parallelism. If
independent and dependent instructions are intermixed appropriately, more
concurrent processing can take place in a multiple-pipe computer.
being replaced by
M,-(R,) (store)
Joonty one memory access
R,<—({R,) (register transfer}
M, M,
\ /
eer KR 2
R R,
M, M,
R,
J\ R,
> Rae Ry
M. M i
a
WWW.Gitmgurgaon.blogspot.com
198 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
being replaced by
R,- (M,) (fetch) oO : 5
. ne memory access
R, —(R,) (register transfer) y
Store-store overwriting The following two memery updates (stores) of the same
word (Figure 3.34c) can be combined into one, since the second store overwrites
the first:
M,—(R,)} (store)
Two memory accesses
M,-—(R,) (store)
being replaced by
M,<-(R2) (store) One memory access
The following example shows how to apply internal forwarding to simplify a
sequence of arithmetic and memory-access operations. Figure 3.35 depicts these
simplification steps, in which adjacent steps are combined to minimize memory
references. Nodes in the graph correspond to the memory cells, registers, an
adder, or a multiplier.
Example 3.1 The inner loop of a certain program is completed to perform the
following operations in a sequence:
1. Ro —(M,) (fetch)
2. Ro — (Ro) + (Ma) (add)
3. Ryo — (Ro) * (M3) (multiply)
4. M,-— Ro (store)
After the internal forwarding, we end up handling a compound function
(macroinstruction) M, —[(M,) + (M,)] *(M3), as represented by the
simplified data-flow graph in Figure 3.35d.
Both internal forwarding and resource tagging have been practiced in the
IBM Model 91 floating-point execution unit. The data registers, transfer paths,
floating-point adder and multiply-divide units, reservation stations, and the
common data bus (CDB) in the Model 91 were shown in Figure 3.33. The three
reservation stations for the adder are denoted as A,, A,, 43. The two reservation
stations in the multiply-divide unit are M, and M,. Each station has the source
and sink registers and their tag and control fields. The stations can hold operands
for the next execution while the functional unit is busy executing current instruction.
PRINCIPLES OF PIPELINING ANP VECTOR PROCESSING 199
3
3 > Re to +— e+
R,@ Yee, ~ a+ M,
b o
om, °
M
(d) Step 3 and siep 4 forwarded
Figure 3.35 Internal data forwarding in Example 3.1 (memory accesses: thick arrows; register transfers:
thin arrows).
WWW.Gitmgurgaon.blogspot.com
200 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Three store data buffers (SDB) and four floating-point registers (FLR) are all
tagged. The busy bits in the FLRs marking their status (1 for busy and 0 for idle)
can be used to determine the dependence of instructions in subsequent executions.
The CDB is used to transfer operands to the FLRs, the reservation stations,
and the SDB. There are I] units that can supply information to the CDB, including
six floating-point buffers (FLB), three adder stations. and two multiply-divide
stations. The tag fields of these units are binary-coded as FLBs | ~ 6, add stations
10 ~ 12, and multiply-divide stations 8 ~ 9. A tag is generated by the CDB
priority controls to identify the unit whose result will next appear on the CDB.
This common data busing and register-tagging scheme permits simultaneous
execution of independent instructions while preserving the essential precedences
inherent in the instruction stream. The CDB can function with any number of
execution units and any number cf accumulators. It provides a hardware algorithm
for the automatic efficient exploitation of multiple arithmetic units. The following
example shows how internal forwarding can be achieved with the tagging scheme
on the CDB.
Example 3.2 Consider the consecutive execution of two floating-point in-
structions in the Model 91 (Figure 3.33), where F refers to an FLR which is
being used as an accumulator and B; stands for the ith FLB. Their contents are
represented by (F) and (B,), respectively:
ADD F,B, F«(F)+(B,)
MPY F,B, Fe(F)*(B,)
In the processing of the add instruction, set the busy bit of F to 1, send the
contents (F) and (B,) to the adder station A,, set the tag field of F to 1010
(the tag value of station 4,), and then carry out the addition.
In the meantime, the decode of the mpy (multiply) instruction reveals
the fact that F is busy. This implies that the mpy depends on the result of the
add. However, the execution should not be halted. Instcad, the tag ofF should
be sent to the multiply station M, to set the tag of M, to be also 1010. Then
the tag of F should be changed to 1000 (the tag value of station A4,) and
the content (B,) sent to M,. When the add instruction is completed, the
CDB finds that the addition result should be sent directly to Mf, (instead of F).
The multiply-divide unit begins its execution when both operands become
available, After the mpy operation is done, the CDB finds F via the tag 1000
of M,, and thus sends the multiply result to F. In this process, the intermediate
result (after addition) will not be sent to F before sending it to M,. This is
exactly a consequence of internal forwarding, using the tag as a vehicle to
identify source and destination in successive computations.
methods and approaches to resolve hazards are then introduced. Hazards dis-
cussed in this section are known as data-dependent hazards. Methods to cope with
such hazards are needed in any type of lookahead processors for either synchron-
ous-pipeline or asynchronous-multiprocessing systems. Another type of hazard
is due to a job scheduling problem and will be described in Section 3.3.5.
When successive instructions overlap their fetch, decode and execution
through a pipeline processor, interinstruction dependencies may arise to prevent
the sequential data flow in the pipeline. For example, an instruction may depend
on the results of a previous instruction. Until the completion of the previous
instruction, the present instruction cannot be initiated into the pipeline. In other
instances, two stages of a pipeline may need to update the same memory location,
Hazards of this sort, ifnot properly detected and resolved, could result in an inter-
lock situation in the pipeline or produce unreliable results by overwriting.
There are three classes of data-dependent hazards, according to various data
update patterns: write after read (WAR) hazards, read after write (RAW) hazards,
and write after write (WAW) hazards. Note that read-after-read does not pose a
problem, because nothing is changed.
We use resource objects to refer to working registers, memory locations, and
special flags. The contents of these resource objects are called data objects. Each
instruction can be considered a mapping from a set of data objects to a set of data
objects. The domain DUD) of an instruction J is the set of resource objects whose
data objects may affect the execution of instruction J. The range RU) of an instruc-
tion fis the set of resource objects whose data objects may be modified by the execu-
tion of instruction J. Obviously, the operands to be used in an instruction execution
are retrieved (read) from its domain, and the results will be stored (written) in its
range. In what follows, we consider the execution of the two instructions J and J in
a program. Instruction J appears after instruction J in the program. There may
be none or other instructions between instructions { and J. The latency between
the two instructions is a very subtle matter. Instruction J may enter the execution
pipe before or after the completion of the execution of instruction [. The improper
timing and data dependencies may create some hazardous situations, as shown
in Figure 3.36.
A RAW hazard between the two instructions / and J may occur when J
attempts to read some data object that has been modified by 7. A WAR hazard
may occur when J attempts to modify some data object that is read by 1. A WAW
hazard may occur if both Jand J attempt to modify the same data object. Formally,
the necessary conditions for these hazards are stated as follows (Figure 3.32):
ROOD #d for RAW
RAR) #6 for WAW (3.18)
DDARI) #¢ for WAR
Possible hazards for the four types of instructions (Table 3.1) are listed in
Table 3.3. Recognizing the existence of possible hazards, computer designers wish
to detect the hazard and then to resolve it effectively. Hazard detection can be done
WWW.Gitmgurgaon.blogspot.com
202 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Instruction /
(rite) Instruction /
(read}
Instruction 7
(write)
(read)
Figure 3.36 Wustration of RAW, WAW,
(co) WAR hazard and WAR hazard conditions.
Once a hazard is detected, the system should resolve the interlock situation.
Consider the instruction sequence {.../J,/ + 1,....J,/ + 1,...} in which a haz-
ard has been detected between the current instruction J and a previous instruction
I. A straightforward approach is to stop the pipe and to suspend the execution of
instructions J, J + 1, J + 2,..., until the instruction J has passed the point of
resource conflict. A more sophisticated approach is to suspend only instruction
J and continue the flow ofinstructionsJ + 1,/ + 2,...,down the pipe. Of course,
the potential hazards due to the suspension of J should be continuously checked
as instructions J + 1,/ + 2,...move ahead of J. Multilevel hazard detection may
be encountered, requiring much more complex control mechanisms to resolve a
stack of hazards.
In order to avoid RAW hazards, IBM engineers developed a short-circuiting
approach which gives a copy of the data object to be written directly to the in-
struction waiting to read the data. This concept was generalized into a technique,
known as data forwarding, which forwards multiple copies of the data to as many
waiting instructions as may wish to read it. A data-forwarding chain can be estab-
lished in some cases. The internal-forwarding and register-tagging techniques
presented in the previous section should be helpful in resolving logic hazards in
pipelines.
WWW.Gitmgurgaon.blogspot.com
204 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
: 0 1 2 3 4 § 6 7 8
I) x x
2 x x x
3 x
4 x x
3 x x
Figure 3.37 Reservation table and state diagram for a unifunction pipeline.
pairs of x’s of each row of the reservation table is called the forbidden set of
latencies. The forbidden set contains all possible latencies that cause collisions
between two initiations. The collision vector is a binary vector, shown below:
C= (C,+°°C2€)) where C; = | ifie
F and C; = Oif otherwise (3.18)
For the example in Figure 3,37, the forbidden list F = {1, 5, 6, 8}, and the
collision vector C = (10110001), where n = 8 is the largest forbidden latency
obtained from the reservation table. This means C,, = | is always true. The collision
vector shows both permitted and forbidden latencies from the same reservation
table. One can use an a-bit shift register to hold the collision vector for implement-
ing a control strategy for successive task initiations in the pipeline. Upon initiation
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 205
of the first task, the collision vector is parallel-loaded into the shift register as the
initial state. The shift register is then shifted right one bit at a time, entering Os
from the left end. A collision-free initiation is allowed at time instant f + k if,
and only if, a bit “0” is being shifted out of the register after & shifts from time ¢.
A state diagram is used to characterize the successive initiations of tasks in the
pipeline in order to find the shortest latency sequence to optimize the control
strategy. A state on the diagram is represented by the contents of the shift register
after the proper number of shifts is made, which is equal to the latency between the
current and next task initiations.
As shown in Figure 3.37b, the initial state corresponds to the collision vector
(10110001). There are four outgoing branches from the initial state, labeled by
latencies 2, 3, 4, and 7, corresponding to, respectively, zero-bit positions C,, Cy,
C4, and C, in the vector (10110001), By shifting right the vector (10110001) two
positions, we obtain the vector (00101100). This vector is then bitwise ored with
the collision vector (10110001) to produce a new collision vector (10111101) as the
new state pointed to by the are labeled 2. Similarly, one obtains the new state
vectors (10110111) and (10111011) after shifting the latencies 3 and 4, respectively.
The are 7 branches back to the initial state. This shifting process should continue
until no more new states can be generated, The shift register will be set to the initial
state, if the latency (shift) is greater than or equal to n.
The successive collision vectors are used to prevent future task collisions with
previously initiated tasks, while the collision vector C is used to prevent possible
collisions with the current task. If a collision vector has a “1” in the ith bit (from
the right) at time ¢, then the task sequence should avoid the initiation of a task at
time t+ i. The bitwise oring operations will avoid collisions in any workable
latency sequence that can be traced on the state diagram. Closed loops or cycles
in the state diagram indicate the steady-state sustainable latency sequences of
task initiations without collisions. The average latency of a cycle is the sum of its
latencies (period) divided by the number of states in the cycle. Any cycle can be
entered from the initial state.
The cycle consisting of states (10110111) and (LOLIIOL1) in Figure 3.376 has
two latencies, three and four. This cycle has a period equal to 7 = 3 + 4. The
average latency of this cycle is | = 3.5. Another cycle, which consists of the states
(10110001), (10111101), and (10111111), has the three latencies 2, 2, and 7, witha
period of 11. Its average latency cycle equals 44 = 3.66. The throughput of a
pipeline is proportional to the reciprocal of the average latency. A latency
sequence is called permissible if no collisions exist in the successive initiations
governed by the given latency sequence. The maximum throughput is achieved
by an optimal scheduling strategy that achieves the minimum average latency
(MAL) without collisions. Thus, the job-sequencing problem is equivalent to
finding a permissible latency cycle with the MAL in the state diagram. The maxi-
mum number of x’s in any single row of the reservation table is a lower bound
of the MAL. In other words, the MAL is always greater than or equal to the maxi-
mum number of check marks in any row of the reservation table.
WWW.Gitmgurgaon.blogspot.com
206 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
cP) 7
G.7) 5
3.4 3.5
(4,3,7) 46
(4,7) 5.5
2,7) 43
(2,2, Tt 3.6
3.4.7) 4.6
+ Greedy cycles.
Simple cycles are those latency cycles in which each state appears only once
per each iteration of the cycle. Listed in Table 3.4 are simple cycles and their
average latencies for the state diagram shown in Figure 3.37). A simple cycle is a
greedy cycle if cach latency contained in the cycle 1s the minimal latency (outgoing
arc) from a state in the cycle, For Figure 3.37b, the cycles (3, 4) and (2, 2, 7) are both
greedy, with average latencies of 3.5 and 3.6, respectively, A good task-imitiation
sequence should include the greedy cycle.
The procedure to determine the greedy cycles on the state diagram is rather
straightforward. From each node of the state diagram, one simply chooses the
are with the smallest latency label until a closed simple cycle can be formed. The
average latency of any greedy cycle is no greater than | F| + 1, where | F| is the
cardinality of set F, which equals the number of |s in the initial collision vector,
The average latency of any greedy cycle is always lower-bounded by the MAL. In
the above example, the greedy cycle (3, 4) has an average latency equal to the
MAL = 3.5, which is smaller than 4, the number of 1s in the initial collision vector.
The job-sequencing method for static unifunction pipelines can be generalized
for designing multifunction pipelines. A pipeline processor which can perform p
distinct functions can be described by p reservation tables overlaid together. In
order to perform multiple functions, the pipeline must be reconfigurable. One
example of a static multifunction pipeline is the arithmetic pipelines in TI-ASC,
which has eight stages with about 20 possible functional configurations. Each
task to be initiated can be associated with a function tag identifying the reservation
table to be used. Collisions may occur between two or more tasks with the same
function tag or from distinct function tags.
The stage-usage pattern for each function can be displayed with a different
tag in the overlaid reservation table. For a p-function pipeline, an overlaid reserva-
tion table is formed by overlaying p unifunctional reservation tables. An overlaid
reservation table for a two-function pipeline is shown in Figure 3.38u, where
A and B stand for two distinct functions. Each task-requesting initiation must
be associated with a function tag, A forbidden set of latencies for a multifunction
pipeline is the collection of collision-causing latencies. A task with function tag
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 207
Nw 0 1 2 3 4
1 A B A B
2 A B
3 B AB A
C,C,C, ¢ Cy Cy Cy Cy
Vaa= © 1 I 0) Vaa= (1 oO 1 1)
Vea= (1 0 1 0) Vaa= (0 I 1 O)
Collision matrices:
WWW.Gitmgurgaon.blogspot.com
208 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
A may collide with a previously initiated task with function tag B if the latency
between these two initiations is a member of the forbidden list.
A cross-collision nector Vig marks the forbidden latencies between the function
pair A and B. The binary vector 4, may be calculated by overlaying the reservation
tables for 4 and B. A component C, = 1 if some row of the overlaid reservation
table contains an A incolumn¢(forsomer)anda Bincolumnt + k;thecomponent
C,, equals 0, if otherwise. Thus, Figure 3.38 has four cross-collision vectors: Vy. =
(0110, Me =U 01 1), Mya = C1 0 1 0), and Weg = (0 1 1 0). In general,
there are p” cross-collision vectors for a p-function pipeline. The p? cross-collision
vectors can be rewritten into p collision matrices, as shown in Figure 3.34b, The
collision matrix Mx indicates forbidden latencies for all functions initiated before
the initiation of a task with the function tag R. The ith row in matrix Mg is the
cross-collision vector 4g, wherei = 1,2,..., p.
A p-function pipeline can be controlled by a bank of p shift registers. Shift
register Q controls the initiation of function Q. The control bits for function initia-
tions are the righmost bit of each shift register. Initiation of a task with function
tag Q is allowed at the next time instant if the rightmost bit of the corresponding
shift register Q is 0. The shift registers shift right one position per each cycle, with
Os entering from the left. Immediately after the initiation of a task with function
tag Q, the collision matrix Mg is ored with the matrix formed by the bank of shift
registers. The state of the shift register @ is bitwise ored with the cross-collision
vector Vg for alll ¢ Qs p.
A state diagram is constructed in Figure 3.38c for the two-function pipeline.
Arcs are labeled with the latency and the function tag of the initiation. The initial
state can be one of the p collision matrices, Cycles in the state diagram correspond
to collision-free patterns of task initiations. Any cycle can be entered from at least
one of the initial states. For example, the cycle (A3, Bl) in Figure 3.38¢ can be
reached by an arc labelled A3 from initial state J, or by an arc Bl from initial
state J,. The method of finding the greedy cycles and the MAL on the state diagram
of a multifunction pipeline can be extended from that for a unifunction pipeline.
in a pipeline if all the latencies in the cycle are allowable. Our main concern so far
has been to find an allowable cycle which results in the MAL. However, an allow-
able cycle with the MAL does not necessarily imply 100 percent utilization of the
pipeline where utilization is measured as the percentage of time the busiest stage
remains busy. When a latency cycle results in a 100 percent utilization of at least
one of the pipeline stages, the periodic latency sequence is called a perfect cycle.
Of course, pipelines with perfect cycles can be better utilized than those with
nonperfect initiation cycles. Ht is trivial to note that constant cycles are all perfect.
Consider a latency cycle C. The set G, of all possible time intervals between
initiations derived from cycle C is called an initiation interval set. For example,
Gp = {4, 8, 12,...} for C = (4), and Gp = {2, 3, 5, 7, 9, 10, 12, 14, 15, 17, 19, 21,
22, 24, 26, ...) for C = (2, 3, 2, 5). Note that the interval is not restricted to two
adjacent initiations. Let G.{mod p) be the set formed by taking mod p equivalents
of all elements of set G.. For the cycle (2, 3, 2, 5) with period p = 12, the set
GAmod 12) = {0, 2, 3, 5,7, 9, 10}. The complement set G,. equals Z — G, where
Z ig the set of positive integers. Clearly, we have G.(mod p) = Z(mod p) —
G,(med p), where Z, is the set of nonnegative integers modulo p. A latency cycle C
with a period p and an initiation imterva] set G, is allowable in a pipeline with a
forbidden latency set F if, and only if,
F(mad p) 1 Ge(mod p) = ¢ (3.19)
This means that there will be no collision if none of the initiation intervals
equals a forbidden latency. Thus, a constant cycle (i) with a period p = lis allowed
for a pipeline processor if, and only if, / does not divide any forbidden latency in the
set F. Another way of looking at the problem is to choose a reservation table whose
forbidden latency set F is a subset of the set G-(mod p). Then the latency cycle C
will be an allowable sequence for the pipeline. For example, the latency cycle C =
(2, 3, 2, 5), Getmod 12) = {0, 2, 3, 5,7, 9, 10} and Gmod 12) = {1,4, 6, 8 11}, so
C can be applied to a pipeline with a forbidden latency set F equal to any subset
of (1, 4, 6, 8, 11}. This condition is very effective to check the applicability (allow-
ability) of an initiation sequence (or a cycle) to a given pipeline, or one can modify
the reservation table ofa pipeline to yield a forbidden list which is confined within
the set G-(med p), if the cycle C is fixed.
Adding noncompute stages to a pipeline can make it allowable for a given cycle.
The effect. of delaying some computation steps can be seen from the reservation
table by writing a d before the step being delayed. Each d indicates one unit of
delay, called an elemental delay. It is assumed that all steps in a column must
complete before any steps in the next column are executed, In Figure 3.394, the
effect of delaying the step in row 0 and column 2 by two time units and the step in
row 2 and column 2 by one time unit is shown in Figure 3.39b. The elemental
delays d,,d,, and d, require the use of the additional delays d,,d,, and d, to make
all the outputs simultaneously available in column 2 of the original reservation
table. .
For a given constant latency cycle (9, a pipeline can be made allowable by
delaying some of the steps if, and only if, there are no more than / marks im each
WWW.Gitmgurgaon.blogspot.com
210 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
0123
4 5
So
s MAL = 4
' Optimal cycle (4)
3;
0 1 23
4 5 6 7
Sy d,|a,
5, d,|d,
8; 4X4
(6) Delay parallel computation steps
Oo 12 3 4 $ 6 7 8 9 10
So 4, d,| d,) 4)
5 a Optimal cycle (1, 5)
“1 6 MAL = 3
5; d,
(c) Enserting delays to make the pipeline allowable for the optimal cycle (1, §)
O12
3 4 5 6 7 8 9 10
fd|
td,
td, da]
row of the table. Thus by adding elemental delays, a unifunction pipeline can always
be fully utilized through the use ofa cycle that has a constant latency equal to the
maximum number of marks in any row of the reservation table. The maximum
achievable throughput of that pipeline is thereby attained. On the other hand, for
an arbitrary cycle, a pipeline can be made allowable by delaying some steps. The
reservation table of Figure 3.39a can be made allowable with respect to the
optimal cycle (1, 5} by adding some elemental delays. The resulting table is shown
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 211
(d) Insert one buffer for each computation step of a two-function pipeline
Figure 3.40 Inserting buffers to impreve the pipeline utilization rate.
WWW.Gitmgurgaon.blogspot.com
212 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
stages. This may cause a collision when one instruction, as a result of bypassing,
attempts to use the operands fetched for preceding instructions. To alleviate this
problem, one solution has each instruction activate a number of consecutive
stages down the pipeline which satisfy its need.
A dynamic pipeline would allow several configurations to be simultaneously
present. For example, a dynamic-pipeline arithmetic unit could perform addition
and multiplication at the same time. Tremendous control overhead and increased
interconnection complexity would be expected, None of the existing pipeline pro-
cessors has achieved this dynamic capability. Most commercial pipelines are
static. In TI-ASC, the desired control allows different instructions to assume dif-
ferent data paths through the arithmetic pipeline at different times. All path-control
information is stored in a read-only memory (ROM), which can be accessed at the
initiation of an instruction.
The configuration for floating-point addition in TI-ASC (Figure 3.28b)
requires four ROM words for its path-interconnection information. This forces
the instruction execution logic to access the ROM for control signals. The ROM
words for a floating-point add may be located at 100, 101, 102, and 103, while
the words for a floating-point subtract could be located at 200, 101, 102, and 103.
The common ROM words (101, 102, 103) used by both operations represent similar
suboperations contained in these two instructions. The starting ROM address is
supplied by the instruction-execution logic directly after the decode of the instruc-
ton.
The pipeline configuration for a floating-point vector dot product in TI-ASC
was depicted in Figure 3.28c. If the dot product operated upon 1000 operands, the
pipeline would be in this configuration for 1000 clock periods. Scalar instructions
in ASC use different control sequences. When several scalar instructions in a se-
quence are of a common type, the instructions streaming through the arithmetic
pipeline can be treated as vectors. This requires a careful selection of ROM output
signals to allow the maximum overlapping of instructions. The ability to overlap
instructions of the same type is achieved by studying the utilization of cach pipeline
segment. Overlaying identical patterns gives the minimum number of clock
periods per result. The two static arithmetic pipeline processors in STAR-100 are
recontigurable with variable structures. Variable structure and resource sharing
are of central importance to designing multifunction pipelines. Systematic pro-
cedures are yet to be developed for designing dynamically reconfigurable pipelines.
In this section, we explain the basic concepts of vector processing and the necessary
implementation requirements. We distinguish vector processing from scalar pro-
cessing, present the characteristics of vector instructions, and define the perfor-
Marice measures of vector processors. We present a parallel vector scheduling
model for multipipeline supercomputers. Three vector processing methods will
be introduced for pipeline computers. After examining the architectures of various
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 213
fi¥ov
fyiVr8
3.20
fyi ¥x VoV ef)
L:V¥xSa¥
where V and § denote a vector operand and a scalar operand, respectively. The
mappings /, and/, are unary operations and f, and f, are binary operations. As
shown in Fable 3.5, the VSQR (vector square root) is an f,; operation, VSUM
(vector summation) is anf, operation, SVP (sealar-vector product) is an fy opera-
tion, and VADD (vector add) is an f,; operation. The dot product of two vectors
Vi - Vy = ¥9., Ky Wa is generated by applying f, (vector multiply) and then f,
(vector sum) operations in sequence. Listed in Table 3.5 are some representative
vector operations that can be found in a modern vector processor. Pipelined
implementation of the four basic vector operations is illustrated in Figure 3.41.
Note that a feedback connection ts needed in the f, operation.
WWW.Gitmgurgaon.blogspot.com
214 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
ee8
ees
x <4
gd
te
(a) fi V, ie ¥, (Av eS
eee
eee
operands. Several examples are shown below to characterize these special vector
operations.
Example 3.3 Let ¥ = (2, 5,8, 7) and ¥ = (9, 3, 6, 4). After the compare in-
struction B = X > Yis executed, the boolean vector B = (0, 1, 1, 1} 1s gener-
ated.
Let ¥ = (1,2,3,4,5,6,7, 8 andB = (1,0, 1,0, 1,0, 1,0), After
the execution
of the compress instruction ¥ = X(B), the compressed vector Y = (1,3, 5, 7) is
generated.
Let ¥ = (1.2.4.8), ¥ = G.5,6, 7), and B = (1, 1,0, 1,0, 0,0, 1). After the
merge instruction Z = X, Y, (B), the resuit is Z = (1, 2, 3, 4, 5, 6, 7, 8). The
first 1 in B indicates that Z(1) is selected from the first element of X. Similarly,
the first 0 in B indicates that 7(3) is selected from the first element of Y.
in general, machine operations suitable for pipelining should have the follow-
ing properties:
a. Identical processes {or functions) are repeatedly invoked many times, each
of which can be subdivided into subprocesses (or subfunctions).
db. Successive operands are fed through the pipeline segments and require as few
buffers and local controls as possible.
. Operations executed by distinct pipelines should be able to share expensive
resources, such as memories and buses, in the system.
1. The operation code must be specified in order to select the functional unit or
to reconfigure a multifunctional unit to perform the specified operation.
Usually, microcode control is used to set up the required resources.
. For a memory-reference instruction, the base addresses are needed for both
tu
source operands and result vectors. If the operands and results are located in
the vector register file, the designated vector registers must be specified.
. The address increment between the elements must be specified. Some computers,
like the Star-100, restrict the elements to be consecutively stored in the main
memory, 1.€., the increment is always 1. Some other computers, like TI-ASC,
can have a variable increment, which offers higher flexibility in application.
. The address offset relative to the base address should be specified. Using the
base address and the offset, the effective memory address can be calculated.
The offset, either positive or negative, offers the use of skewed vectors to
achieve parallel accesses.
WWW.Gitmgurgaon.blogspot.com
216 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
5. The vector length is needed to determine the termination ofa vector instruction.
A masking vector may be used to mask off some of the elements without
changing the contents of the original vectors.
TEMP(1:N)=A(2:N+14)
A(1:N)=B(1:N)+C(4:N)
B(1:N}=2*TEMP(1:N)
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 217
where A(1:N) refers to the N-element vector A(1), A(2},..., ACN). The
introduction of the TEMP(1:N) vector is necessary to enable the vectoriza-
tion.
The execution of the scalar loop repeats the loop-control overhead in each
iteration. In vector processing using pipelines, the overhead is reduced by using
hardware or firmware controls. A vector-length register can be used to control
the vector operations. The overhead of pipeline processing is mainly the setup time,
which is needed to route the operands among functional units. For example, in the
ASC and Star-100 systems, each vector instruction needs to get some vector-
parameter registers or contro] vectors before the instruction can be mitiated. Thus,
many additional memory fetches are needed to load the control registers. Another
overhead is the flushing time between the decoding of a vector instruction and the
exit of the first result from the pipeline. The flushing time exists for both vector and
scalar processing; however, a vector pipe has to check the termination condition
and the control vectors. Therefore, a vector pipe may have a longer flush time than
its sequential counterpart.
The vector length affects the processing efficiency because of the additional
overhead caused by subdividing a long vector. In order to enhance the vector-
processing capability, an optimized object code must be produced to maxi-
mize the utilization of pipeline resources. The following appreaches have been
suggested :
Enrich the vector instruction set With a richer instruction set, the processing
capability will be enhanced. One can avoid excessive memory accesses and poor
resource utilization with an improved instruction set. The compress instruction in
Example 3.3 was a good example of saving memory.
Combine scalar instructions Using a pipeline for processing scalar quantities, one
should group scalar instructions of the same type together as a batch instead of
interleaving them. The overhead due to the pipeline reconfiguration can be greatly
reduced by grouping scalar instructions.
WWW.Gitmgurgaon.blogspot.com
218 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
parallelism
s
Ror
Parallelism 4
A (Vectorization)
“
M
oe ' '
WWW.Gitmgurgaon.blogspot.com
220 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Scalar processor
y
Vector
High- instruction
speed
controller t
main
memory
Pipe i
7 ‘
°
Vect e
Pa access , aj. Vector fo | @
on “t > “ a
controller registers Pipe m
Vector processor
Figure 3.43 The architecture of a typical vector processor with multiple functional pipes.
2, < isa partial ordering relation. specifying the precedence relationship among
the tasks in set T.
3. t: T> R* is a time function defining the production delay 1(7)) for each 7; in
T. We shall denote the value 1(7) simply as 7, for alli = 1,2,..., 7.
Let P= {P,, P,,..., Py} be the set of vector pipelines and R* be the set of
possible time intervals, The utilization of a pipeline P, within interval [x. y] is
denoted by P(x, y). The set of all possible pipeline-utilization patterns is called the
resource space, which is equal to the cartesian product P x R* = {PAx, »)| Pe P
and (x, y)¢ R*}. A parallel schedule f for a vector task system V = (T, <.t)isa
total function defined by
fi To meee (3.22)
where 2°* ®’ is the power set of the resource space P x R*. Typically, we have the
following mapping for each 7, T. The index i,¢ {1,2,.... #2} could be repeated
This mapping actually subdivides the task 7; into p subtasks T;,, Tj.,..., Tj,. Sub-
task 7), will be executed by pipeline P;, for each j = 1,2,...,p. We call
(Tjli = 1.2,..., p} a partition of the task 7;. The following conditions must be
met in order to facilitate multiple pipeline operations:
1. For all intervals [xj yj] f= 1.2..... Pp. y; — X; > ¢, and the total production
delay t; = 3%... (Vj — Xj — 4)
2. If Py, = Py. then (x, ps) fx,.4] = 4. This implies that each pipeline is
static, performing only one subtask at a time.
The fiaish time for vector lask 7; is FCT) = max{yy,yo.-..,¥) The finish
time of a parallel schedule for an n-task system is defined by
The purpose is to find a “good” parallel schedule such that « can be minimized.
This deterministic scheduling concept is clarified by the following example:
WWW.Gitmgurgaon.blogspot.com
222 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
noi g§ 9 13 14
Py Ty ‘ Ty
2} fy Ts [1 Ty LZ alot | Tay |
Ts, and 73, with t,, = 4 and t,; = 2. The parallel schedule fis specified by
the following mappings with a finish time o» = FCT3) = 14:
f(7,) = {P, (0, 8), P23, 7)} with S(T,) = Oand F(T)) = 8
f(T) = (PAR. 11} with S(T,) = 8 and F(T) = 11
f(T) = (P,(8. 13), P(t), 14)} with S(7;) = 8 and F(1;) = 14
f(T) = (P0,3} with S(T,) = O and F(T,) = 3
The multiple-pipeline scheduling problem can be formally stated as a feasi-
bility problem: Given a vector task system V, a vector computer with m identical
pipelines, and a deadline D, does there exist a parallel schedule f far V with finish
time 3 such that w < D°? This scheduling problem has been proven to be compu-
tationally intractable. In practice. the production delays of different vector
tasks are different. These unequal production delays lead to the intractability of the
multi-pipeline scheduling problem. Therefore, we have to seek heuristic algorithms
in real-life system designs. The heuristics must be simple to implement, with low
system overhead, and with nearly optimal performance.
Consider a vector processor with m pipelines with a fixed overhead lime 1,
for all instructions. The input to the scheduler is an independent task system V
with # vector tasks which are totally unrelated. The task scheduler is a built-in part
of the vector instruction controller. The cutput is the parallel schedule f for FV.
Let t; be the time span of using pipeline P, for the execution of various tasks in a
given task system VF. This time Span includes the overhead time ¢, every time the
pipeline is reconfigured to assume a new task (or a new subtask), the production
limes tor 7,,), and some idle times between successive tasks.
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 223
The average time span 7,(k) for partitioning a tasks into n + & subtasks over m
pipelines is detined by
YG
+ t,) + kt,
L(k) = 42h.
=
oo (3.26)
If there is no subdivision of the original tasks in a schedule, the average time span
tk) 1s reduced as follows, when k = 0,
j= ails (3.27)
This quantity (0) is an absolute lower bound of the finish time w, defined in
Eq. 3.24. This means that an optimal schedule is generated when @ = ,(0).
Scheduling independent tasks among the m pipelines is done by making the time
span i; (for j = 1,2,..., m)as close to 7,(k) as possible. As demonstrated in Figure
3.45, a bin-packing approach is used to generate a parallel schedule for inde-
pendent tasks. First, we assign some tasks to pipeline P, untiltimer, > 7,(0) — 1,/2.
Then we switch to pipeline P, for assigning the remaining tasks until, > 1,(k) ~
t,/2, where k = 0 or | depending on how many subdivisions of tasks have been
(Time} AO
hte ‘
ees8
ees
“
(Pipelines)
WSS
ou
NS
: i te
ft [Ors
et
: ti)
f
1(O—- >
Note: shaded areas correspond to pipelines that have been assigned with vector tasks,
Figure 3.45 Multipipeline scheduling for independent vector (asks with a bin-packing appreach.
WWW.Gitmgurgaon.blogspot.com
224 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
performed. This generating process will repeat in a sequential manner for the re-
maining pipelines.
In general, we will switch to the next pipeline, P;,,, when the following
boundary condition is met:
1,2 1,(K) — 2 (3.28)
Furthermore, we will subdivide the current task and update 1,(k) if the following
condition is met, before switching to pipeline P;, ;:
tS 1k) + 3
ty
(3.29)
We consider below, as an example, the schedule of a tree structured task
system based on the partitioned bin-packing procedure. This procedure generates a
partition, {£,,£,....,£,}, ofall tasks in the tree system. The first block E, consists
of all tasks on leave nodes, The second block E, consists of those tasks on the
“new” leave nodes after removing tasks in E, from the tree. This process continues
until reaching the root, which forms the last block E, where / equals the tree height.
We shail process tasks in E, before E,, ifi < j. In this sense, each E, can be consid-
ered as a set of independent tasks, which can be dispatched concurrently as
described above.
Example 3.6 We are given a tree task system Y= (7, 0,1), where T=
iT,,.-.. Te} follows the tree relationship shown in Figure 3.46a. Suppose
to=1t, = 2.t2 =413 = 6,14= 8 t. = 8, = 2.47 = Oty = 4andts =
4, as marked in the tree graph. To schedule this tree task system on m = 4
identical pipelines, we first obtain the partition E, = {7\, T,. Ty. Tj}. Ey =
iTs. Te. Tat, Ey = {77}. Ex = {Ty} as circled by dashed lines in the figure. A
parallel schedule /, is generated, as depicted in Figure 3.46b. Shaded areas
indicate the idle periods of pipelines. Tasks T,, 73, T,, Ts, T;, and Ty have been
subdivided into subtasks:
SAT) = (PO, 3)}
fal Tz) = {P,G, 6.3), P20, 2.5)}
fl Ts) = {PA2.5, 6.75), P30, 3.75}
fol Ta) = {P(3.75, 7), PCO, 6.75)}
falTs) = {P,(7, 11.75), P2(7, 12), PT, 8.25)
Ag( To) = {P3(8.25, 11.25}
faTa) = (PACT, 12)
fol Tr) = (P12, 14.5):1 <1 <4}
falTs) = (P45, 16.5):1 <3 < 4}
The finish time w = 16.5 has the same order of magnitude as eo, = 13.25, the
finish time of an optimal schedule.
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 225
E i
Oo 7 12 14.5 16.5
Pu ty 7, % Tr % Ts4 ‘0 Tr “| To
WWW.Gitmgurgaon.blogspot.com
226 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Mt Zypt2yrbt+
Zin
il
We will compare the total execution times of various pipeline processing methods
with 7, in order to reveal their relative speedups.
The way that addition pairs (operands) are scheduled distinguishes the three
processing methods. In what follews, we assume that all multiplications have
already been carried out by the pipeline in T,, pipeline cycles:
T, me = 5t (setup time) + 52 (time for the first product to come out of the pipe)
+ (mn — 1)¢(time for producing all the remaining preducts)
ae mnt + Sr (3.33)
lt is assumed that the main memory is large enough to hold ail intermediate
results. There is a feedback path from the output of the pipeline to one of the
two inputs if needed for cumulative additions. Let T, be the total number of clock
periods needed for the pipelined addition in each of the following methods. It is
assumed m > 5 and n > 5,
Horizontal vector processing In this method, all components of the vector y are
calculated in a sequential order, y,; for i= 1,2,..., m. Each summation
y, = Mi. 2) involving (a ~ 1) additions must be completed before switching
to the evaluation of the next summation y,,., = yr 274 1,7- PO evaluate each y,
requires (n + 14) clock periods. The total add time for m outputs equals
DO 100 i=1,m,1
DO 100 j=1,n,1
YiFY¥ Tax,
100 CONTINUE
5 _ Ty Oma — Smt Ln Sm
“horizontal PF (horizontal) (ma + f4m + 9) 2mn + 14m + 9
(3.35)
Vertical vector processing The sequence of additions in this method is specified
below with respect to the m-by-n array shown in Eq, 3.31:
Step 1. Compute the partial sums (z;, + z,:) = y,,fori = 1,2,..., msequentially
through the pipeline.
Step 2, Compute the partial sums (y,2 + z;3) for i = 1, 2,..., #7 by loading y'},
into one input port in stage 1 and loading z;, into the second input port.
WWW.Gitmgurgaon.blogspot.com
228 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Step 1. Apply the vertical processing method to generate the first block of five
outputs, y,. ¥2,.--» J's, in column fashion.
Step ? to Step k. Repeat Step 1 for generating the remaining five-output blocks as
listed below:
Step 2: ys, ¥r> +++ Fro
The total add time of this vector-looping method is given below, where
k = (m — F)/5.
memory-to-memory pipeline operations, like those in the Star-100 and the Cyber-
205. The vector-leoping method is also not restricted by vector length. Since the
intermediate results appear as small blocks of data, one can use a cache memory
or fast-register arrays to hold the intermediate results. Thus vector looping is more
suitable for register-to-register pipeline operations, such as in the Cray-1 and the
Fujitsu VP-200. It is interesting to note that all the speedups approach 5, the
number of stages in the sample pipeline, when n and m are very large in the per-
formance analysis.
The speed of a scalar processor is usually measured by the number of instruc-
tions executed per unit time, such as the use of a million instructions per second
(MIPS) as a measure. For a vector processor, it is universally accepted to measure
the number of arithmetic operations performed per unit time, such as the use of
mega floating-point operations per second (megaflops). Note that the conversion
between mips and megaflops depends on the machine type. There is no fixed
relationship between the two measures. In general, to perform a floating-point
operation in a scalar processor may require two to five instructions, If we consider
the average as three, then one megaflops may imply three mips. This conversion
constant is machine dependent, Other authors compare the speeds of different
computers by choosing a reference machine. Readers should be aware of the
difference between the peak speed and the average speed when benchmark programs
or test computations are executed on each machine. The peak speed corresponds
to the maximum theoretical CPU rate, whereas the average speed is determined
by the processing times of a large number of mixed jobs including both CPU
and I/O operations.
WWW.Gitmgurgaon.blogspot.com
230 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
using noncompute delays and internal buffers were proposed by Patel (1976,
1978a, b). Lookahead techniques such as hazard resolution and data forwarding
have been treated by Keller (1975) and Tomasulo (1967). The modeling of a vector
processor with multiple pipelines is based on the work of Hwang and Su (1983).
Static pipes are commercially designed because of less control and hardware
costs. However, systems requiring reliable and flexible designs may have to use
dynamic pipes in order to enhance fault-tolerance capability and to increase the
resource utilization.
Problems
3.1 Describe the following terminologies associated with pipeline computers and vector processing:
(a) Static pipeline (k) Minimum ayerage latency
(6) Dynamic pipeline (4) Precise vs, imprecise interrupts
(ce) Unifunctional pipeline (m} Perfect cycle
(2) Multifunctional pipeline (n) Greedy cycle
(e) Instruction pipeline (0) Data-dependent hazards
Cf) Arithmetic pipeline (p) Short circuiting
(g) Pipeline efficiency (q) Internal forwarding
(4) Pipeline throughput (rf) Vectorizer
(i) Forbidden latencies (s} Branch target buffering
(j) Collision vector (2) Register tagging
3.2 Compare the advantages and disadvantages of the three interleaved memary organizations:
the S-access, the C-access, and the CyS-access described in Section 3.1.4 for pipelined vector
accessing, In the comparison, you should be concerned with the issues on effective memory bandwidth,
storage schemes used, access conflict resolution, and cost-effectiveness tradeoffs.
3.3 Consider a four-segment normalized floating-point adder with a 10-ns delay per each segment,
which equals the pipeline clock period,
(a) Name the appropriate functions to be performed by the four segments.
(b) Find the minimum number of periods required to add 100 floating-point numbers 4, +
A, ++-+ + Ajeo using this pipeline adder, assuming that the output Z of segment 54 can be routed
back to any of the two inputs X or ¥ of the pipeline with delays equal to any multiples of the period.
xX
¥
[SP SPeL SPs 2
3.4 A certain dynamic pipeline with the four segments S,, 5;, 53, and S, is characterized by the
following reservation table:
tw
tei
A)
PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 231
(a) Determine latencies in the forbidden list F and the collision vector C.
(6) Determine the minimum constant latency L by checking the forbidden list.
{c) Draw the state diagram for this pipeline. Determine the minimal average latency (MAL) and
the maximum throughput of this pipeline.
3.5 For the following reservation table of a pipeline processor, give the forbidden list of avoided
latencies F, the lower bound on latency, the collision vector, the state diagram, the MAL and all
greedy cycles.
ao fr 2 73 "4
Sj Al B AIB
S; BIA
5,)B B
(a) List all four cross forbidden lists of latencies and corresponding combined crass-collision
matrices.
(6) Draw the state diagram for the two-functional pipeline,
3.7 Assume that instructions are executed in a k-segment pipeline. The delay of each segment is one
time unit. If an instruction depends on one or more of its predecessors, then all these predecessors must
complete execution before the current instruction can begin execution. If such a predecessor is
instructions preceding the current instruction, a delay is added as & — N time units for N < & and
no delay for 7 > &. Let p, be the probability of encountering a data dependency from the ath pre-
decessor, Assume an integer L > &, Suppose that p, has the distribution p, = Wk forn = 1,2,...,2
and p, equal zero otherwise.
(a) Find the expected value of the total time T to execute a block of M instructions.
(6) Determine the performance P of the instruction pipeline, where
_ Mf
P= lim —
Mo co
3.8 (@) Suppose that only two 4-segment pipelined adders and a number of noncompute delay
elements are available, The delay of cach segment is one time unit and the noncompute delay element
can have either a one- or two-time unit delay. Using available resources, construct a pipeline with only
one input, a’s, to compute b(/) = ali) + a(i — 1) + a{i — 2) + af — 3). Show the schematic block
diagram of your design.
(6) Given ane additional four-segment pipelined adder, use this adder together with the pipeline
obtained from («} to design a pipeline for computing the recurrence function x{f} = a(f) + x(i — 1}.
The pipeline constructed should have a feedback, Show your schematic block diagram. Hint: x{i) =
aay + xi — VW = a) + Lai — 1) + x ~ 2] = a) + ai — 1) [eth — 2) + xi — 3)) = at) +
ati 1) + ali — 2) + [ai ~ 3) + xi — 4] = A + x ~ 9).
WWW.Gitmgurgaon.blogspot.com
232 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
3.9 Consider the following pipelined processor with four stages. All successor stages after cach stage
must be used in successive clock periods.
Input
breton OHELPTET
Answer the following questions associated with using this pipeline with an eraluation time of six
pipeline clock periods,
{@) Write out the reservation table for this pipeline with six columns and four rows.
(b) List the set of forbidden latencies between task initiations,
(c} Show the initial collision vector,
(d} Draw the state diagram which shows all the possible latency cycles.
(e) List all the simple cycles from the state diagram.
(7) List all the greedy cycles from the state diagram.
(g} What is the value of the minimal average latency (MAL)?
(A) Indicate the minimum constant latency cycle for this pipeline.
(3) What is the maximal throughput of this pipeline?
3.10 (a) How does the IBM 360/91 avoid problems due to data dependencies involving the contents of
floating-point registers within the floating-point execution unit? In your answer, especially address
each type of hazard, indicating how each is controlled.
(6) The floating-point execution unit in the 366/91 handles data dependencies involving floating-
point register contents. What data dependencies cas arise in the execution of floating-point instructians
(including loads to and stores from the floating-point registers) that involve the contents af some
memory word? How can these dependencies be managed? Efficiency is a prime consideration. Use a
block diagram to illustrate the organization of the major hardware units that your solution requires.
Explain the operation of each of these units.
3.11 Answer the following questions related to the task initiation cycle (2, 3, 7) for a given pipelined
processor.
(a) What are the period p and the average latency /, of this initiation cycle?
(6) Specify the initiation interval set G Gmod p).
(c) What is the necessary and sufficient condition that a given task initiation cycle is allowed by a
pipeline with a forbidden latency set #? Repeat the same question for a constant initiation cycle with
period p.
3.12 Suppose that scalar operations take 10 times longer (o execute per result than vector operations.
Given a program which is originally written in scalar code:
(a) What are the percentages of the code needed to be vectorized in order to achieve the speedup
factors of 2, 4, and 6 respectively?
(8) Suppose the program contains 15%. of code that cannot be vectorized such as sequential 1/O
operations, Now repeat question (a) for the remaining code to achieve the three speedup factors.
CHAPTER
FOUR
PIPELINE COMPUTERS AND VECTORIZATION
METHODS
This chapter describes the system architectures and vector processing techniques
developed with existing pipeline computers. The first section gives a historical
retraspective of pipeline computers in two architectural categories: vector super-
computers and attached array processors. We will examine three attached pro-
cessors: the AP-120B (FPS-164), the IBM 3838, and the MATP. Vector super-
computers to be studied include the early systems Star-]00 and TI-ASC, and the
recent systems Cray-1, Cyber-205, and VP-200, and their possible extensions.
Finally, we will study vectorizing compiling techniques, optimization methods,
and performance evaluation issues in designing or using pipeline computers.
Pipeline computers refer to those digital machines that provide overlapped data
processing in the central processor, in the 1/O processor, and in the memory
hierarchy. Pipelining is practiced not only in program execution but also in
program loading and data fetching operations. Univac-1 was the first machine
that overlapped program execution with some I/O activities. With the develop-
ment of interleaved memory, memory words in successive memory. modules could
be fetched in a pipelined fashion. These pipeline memory fetches prompted the
overlapped instruction fetches and instruction executions as pioneered in the
IBM 7094 series in the Stretch project and in the Univac-Larc system.
The performance of a pipeline processor may be significantly degraded by the
data dependency holdup problem. The evolution of the CDC 6000/7000 series
has contributed to the development of hardware/software mechanisms to overcome
this difficulty. In addition to further partitioning the instruction execution process,
the CDC 6000 series uses a “status checkboard” to indicate the availabilities of
various resources in the computer required to execute various stages of sabsequent
WWW.Gitmgurgaon.blogspot.com 233
234 COMPUTER ARCHITECTURE ANI? PARALLEL PROCESSING
WWW.Gitmgurgaon.blogspot.com
236 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
hest and the back-end machine. In this sense, the supercomputers Cray-1 and
Cyber-205 are also back-end machines driven by a hast machine.
The projected speed performances of the aforementioned pipeline super-
computers and attached precessors are compared in Figure 4.1. We use the measure
million operations per second (mops) to refer to either megaflops or a million integer
operations per second. All speeds indicated within the parentheses refer to the
theoretical peak performance, if the machine is used sensibly. In the late sixties,
only pipeline scalar processors were available with a maximum speed of5 mops, as
represented by the LBM 360/01 and 370/195 series, and by the CDC 6600/7600
series. The first-generation vector processors Star-100 and TI-ASC have a speed
ranging from 30 to 50 mops. The second-generation vector processors Cray-1,
Cyber-205, and VP-200 have a speed between 100 and 800 mops.
The attached processors AP-120B and FPS-164 have a peak speed of 12
megaflops. The FACOM 230/75 can perform 22 megaflops. MATP is a four-
pipeline multiprocessor which can operate up to 120 megaflops. The IBM 3838
has a peak speed cf 30 megaflops. The Fujitsu VP-200 is extended from its pre-
decessor FACOM 230/75. The first Cray X-MP became available in 1983.
,
108
NASF (3,000)
¢
a
”
Million operations per seconds (MOPS)
PP Cray-2 (2,000)
44
10° The ?
second Cyber-208 (800) 0% Uf
generation VP-200 (500)
vector
processor .
Cray-1 (168) Crayy X-MP (40) (490
MATP (120)
16 5 *
The first ar 100
generation (50) Attached
ee e @e8 (30) processors
. TI-ASC (30
FASC (30) fe . ACOM 230/75 (22)
io b ,
AP-120B (12) 6 FPS-164 : (12)
7600(5)
360/195
Scalar
processors
, 860011) $60 /0 (2)
l a fp
1965 1970 1975 1980 1985
Year
Figure 4.1 The theoretical peak performance of pipelined supercomputers and attached scientific pro-
cessors.
PIPELINE COMPUTERS AND VECTORIZATION METHODS 237
The future supercomputers Cyber 2xx, Cray 2, and S-1 are expected to perform
over 3000 megaflops for applications in the 1990s.
Of course, the peak speeds in Figure 4.1 may not always be attainable. For
average programs written by nonspecialists, the speed is much lower than those
peak values indicated. For mixed programs, it has been estimated that the average
rate of the Cray-1 is 24 megaflops, of the Star-100 is 16 megaflops, and of the
AP-120B is only 6 megaflops. These measured operating speeds are low because
the software is not properly tuned to explore the hardware. These issues will be
studied in Section 4.5 along with language, compiling, vectorization, and optimiza-
tion facilities for vector processors.
Dedicated computers for vector processing started with the introduction of two
supercomputer systems known as the CDC-Star and the TI-ASC, both of which
have multiple pipeline processors for stand-alone operations. Based on the em-
ployed technology and architectural features, the two generations of vector
processors differ in many aspects. In this section, architectural structures, pipelined
arithmetic designs and vector processing in the Star and ASC are described. In the
next section, we will study more recent vector processors.
WWW.Gitmgurgaon.blogspot.com
238 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
_..Pipeline processor 2
Multipu
SAC Stream
an Ie
128 bits [write Write Floating-point
Banks
x, fanout buffer add
0-3
Register
4-7 divide
8-11
12-15
16-19
Floating-point
|
20-23 add pipe
24-27 Read bus Read
fan-in buffer
28-317 31 Multiply i
128 bits
x8 Pipeline processor 1 i
tt
String
! 1/0 channels
t‘
{ and DMA
1{
t
t
Maintenance
station
oe
\ 1/0
¥ CH 2-4
Sarr
Optional
1/0 CH
5-8 & 9-12
Direct access
channel
Figure 4.2 The system architecture of Star-£00, (Courtesy of Control Data Corp.)
operand streams to the pipeline processors. The third bus is used for storing the
result stream, and the fourth bus is shared between input-output storage requests
and the references of control vectors.
The Srorage Access Contre! unit controls the transmission of all data to and
from the memory. It is responsible for memory sharing among the various buses
shared by the stream and J/O units. Its principal function is to perform virtual
memory address comparison and translation. The Stream unit provides basic
control for the system. All memory references and many control signals
originate from this unit. It has the facilities for instruction buffering and decoding.
The Read Buffer and Write Buffer are used to synchronize the four active buses to
maintain a smooth data transfer. The memory requests are buffered eight banks
PIPELINE COMPUTERS AND VECTORIZATION METHODS 239
apart to avoid access conflicts, As a result, the maximum pipeline rate can be
sustained regardless of distribution of addresses on the four active buses.
Other functional units in the Stream unit include the register file and the micro-
code memory. The register file supplies necessary addressing for all source operands
and results. It also has the capability of performing simple logical and arithmetic
operations. The semiconductor microcode memory is used as part of the stream
control. The control signals and enable conditions produced by the microcode are
used together with the hardwired control to process instructions and interrupts.
The String unit processes strings of decimal or binary digits and performs bit-
logical and character-string operations. It contains several adders to execute
binary coded decimal (BCD) and binary arithmetic.
In the Star-100 are two independent arithmetic pipelines (Figure 4.3). The pipe-
line processor | consists ofa 64-bit floating-point (henceforth FLP) add unit and a
32-bit FLP multip{y unit. The add pipeline on the right contains four segment
groups in cascade. The exponent compare segment compares exponents and saves
the larger. The difference between the exponents is then saved as a shift count by
which the fraction with the smaller exponent is right-shifted in the coefficient
alignment segment. In the add segment, the shifted and unshifted fractions are
added. The sum and the larger exponent are then gated to the normalized segments.
The transmit segment selects the desired upper or lower half of the sum, checks for
any fraction overflow, and transmits the results to the designated data bus. There
is a path from the output of the transmit segment to the input of the receive seg-
ment. This feedback feature is especially useful for continuous addition of multiple
floating-point numbers. However, when nonstreaming-type operations are
performed, the execution time can be decreased by 50 percent if the output of an
operation is needed as an input operand for subsequent operations.
With little additional hardware, it is possible to split the 64-bit add pipeline
into two independent 32-bit ones. Consequently, half-width (32-bit) arithmetic
can be available. The 32-bit multiply pipeline is implemented with multiplier-
recoding logic, multiplicand-gating network, and several levels of carry-save
adders. A resultant preduct of the multiplication is formed by adding the final
partial sum and the saved carry vector, The required post-normalization after
FLP multiply is done using the normalize segments of the add pipeline on the
right.
Processor number 2, depicted in Figure 4.36, contains a pipelined add unit, a
nonpipelined divide unit, a pipelined multipurpose unit, and some pipelined merge
units. The add pipeline in processor number 2 is similar to that in processor
number 1. The multipurpose pipeline has 24 segments and is capable of performing
multiply, divide, square root, and a number of other arithmetic logic operations.
The register divide unit 1s a nonpipelined divider which can also perform BCD
arithmetic.
Two 32-bit multiply pipelines can be combined to form a 64-bit multiply
pipeline. This combined unit can simultaneously execute two 32-bit multiplications
or execute one 64-bit multiplication, In order to perform a 64-bit multiplication,
the multiplicand 4 and multiplier B are each split into twe parts, A = 4g + 4,-2",
WWW.Gitmgurgaon.blogspot.com
(die BEG josju0-y Jo AsazinEy) “YH ]-18Ig Uy seuyadid InIMYIY Cp BINT
wleasys Z Jossaaoig (q} weds 4 10888901 (D)
01 o1
wed eyeey
TUWSUEIL, i
Pt), : ums |
i aylys : \ : azTfeUuLON] 7
‘ aTIPRULION 5 unos i i
:5 yunes :3 azqeULON :: yuna »: [ines]
! !
i aay7euLON , i] gOtHeGHON Ti : Leaeauon :
i
{i :
i po
i :
4 ::
3 ‘ : ::
:
zatsay t
:
;
|
i
: i ; : | zagiap | i
: 1 assy a : i i
! ' ! : 1] oa73] i
!
Tt i
Pot i
; :
‘ i : :
: ! : : ro i
i apap (s}UsuIBas $7} : i : aay i
: JisiB3y sso MELINA ; Ppy po i
pew aurladid | Pram | i | fe]:
i waa? asocanca| RW 7 | weIyye0D i ACH i
:\ 1 ja0D
Walsh :i aredmo2 po
i : danny
;
i
: aaeduroo aZyENUy : qweauedxy ' i i
:
qoutes : j Sugai
; i Ardniny
|
'
! BAISIOY
<2)
:
' i
i
i wend
ed coe auipdid Geass
aunjadid weal ys Pey wo
240
ppy Wold Beg
PIPELINE COMPUTERS AND VECTORIZATION METHODS 241
and B = By + B,- 2", where w = 32 bits, the width of the basic multiply pipeline.
Then the following four multiplications are performed:
A x B= Ag x By + (Ay X By + A, X Bo}
2" + (Ar x By) -2" (4.1)
Ao x By and Ap x B, are executed during the first cycle of multiplication,
and A, x By and A, x B, during the second cycle. Afterward, all partial sums
and partial carries are merged in a 64-bit merge section, which is essentially a set
of carry-save adder trees (pipelines). The partial sum and partial product from the
64-bit merge section are then added together by two adders to yield the final 64-bit
product.
The Star-100 has 130 scalar instructions and 65 vector instructions, as cate-
gorically listed in Table 4.1. Vectors in the Star-100 are formed as strings of binary
numbers or characters, or as arrays of 32- or 64-bit FLP numbers. The sparse
vector instructions can process compressed sparse vectors. When the pipeline
eriters streaming operations, it is possible to maintain a 40-ns output rate. The
input-to-output time for the FLP add pipeline is 160 ns, because there are
essentially four pipeline segments. The time delay of the FLP multiply pipeline
equals 320 ns. The maximum throughput for different arithmetic operations is
summarized in Table 4.2. These are peak CPU speeds. In practice, the measured
average speed of the Star-[00 is only 0.5 to 1.5 megaflops for scalar operations and
5 to 10 megaflops for vector operations, lower than its designed capabilities. It is
quite obvious that double-precision FLP operations require more time to com-
plete, twice the add/subtract time and four times the multiply or divide time
compared to their single-precision counterparts.
Texas Instruments Advanced Scientific Computer (ASC) was delivered in 1972.
The central processor of ASC is incorporated with a high degree of pipelining in
both instruction and arithmetic levels. The basic components of the ASC system
WWW.Gitmgurgaon.blogspot.com
242 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
mops)
Floating-point 32-bit 64-bit
operations {short} (long)
Add-subtract 100 50
Multiply 100 25
Divide 50 12.5
Square root 50 12.5
are shown in Figure 4.4. The central processor is used for its high speed to process
a large array of data. The peripheral processing unit is used by the operating
system. Disk channels and tape channels support a large number of storage units.
Data concentrators are included for support of remote batch and interactive
terminals. The memory banks and an optional memory extension are managed
by the memory control unit. The main memory has eight interleaved modules,
each with a cycle time of 160 ns and a word length of 32 bits. Eight memory words
can be transferred in one memory access, The memory contro! unit is an interface
between eight independent processor ports and nine memory buses. Each processor
port has full accessibility to all memories.
The central processor can execute beth scalar and vector instructions.
Figure 4.5 illustrates the functional pipelines in the central processor. The pro-
cessor includes the instruction processing unit (IPU), the memory buffer unit
control . .
unit p—s—* | Disc storage units
hannels Lap
c—~*
| Tape 7—* | Magnetic tape drivers
channels i_t—@
f-——w | Remote communications,
centrators beseaaare ao
extension
|
Figure 4.4 Basic Texas Instruments ASC systems configuration,
PIPELINE COMPUTERS AND VECTORIZATION METHODS 243
Instructions Vector
90
16 Base regs, parameter regs,
memory
ports Operand
Figure 4.5 Central processor of the TI-ASC with four arithmetic logic pipelines. (Courtesy of Texas
Instruments, Inc.}
(MBU), and the arithmetic unit (AU). Up te four arithmetic pipelines (MBU-
AU pairs) can be built into the central processor. The ASC instruction types are
listed in Table 4.3. The maximum ASC speed per arithmetic pipeline is given in
Table 4.4. On the average, only 0.5 to 1.5 megaflops and 3 to 10 megaflops per
pipeline can be expected for scalar and vector operations, respectively.
The primary function of the [PU is to supply the rest of the central processor
with a continuous stream of instructions. Internally, the [PU is a multisegment
pipeline which has 48 program-addressable registers for fetching and decoding
instructions and generating the operand address. Instructions are first fetched in
octets (8 words) from memory into the instruction buffers of 16 registers. Then
the TPU performs assignment of instructions to the MBU-AU pairs to achieve
optimal use of the arithmetic pipelines. The MBU is an interface between main
memory and the arithmetic pipelines. Its primary function is to support the
arithmetic units with continuous streams of operands. The MBU has three double
buffers, with each buffer having eight registers. ¥ and Y buffers are used for inputs,
and a Z buffer is used for output. The fetch and store of data are made in 8-word
WWW.Gitmgurgaon.blogspot.com
244 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
pono mal
cH |
(offset |
'for C, Z)
F G x A Y B Zz c
(8X, 9X) |[(subfunc-] (offset | (field (offset | (field (CV base] (field
tion) for A) | length 8:| for 8) | length 8:/address) | length 8:
base base base
address) address) address)
WWW.Gitmgurgaon.blogspot.com
246 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
and subfunction codes. The rest of the fields designate the working registers to be
used, The field C + 1 automatically specifies the register holding the offset for
control of the result vectors. The effective starting address is calculated as the
sum of the base address and the offset. The effective field length is calculated
as the offset subtracted from the field length. Thus, the ending address is the
sum of the effective starting address and the effective field length. With offset
capability, the ith element in the source operand can operate with the (i + d)th
element of another source operand, where d is the difference between the two
offsets. The following example shows the streaming operations for vector addition
in the Star-100:
The starting address and effective field length of the 4 vector are calculated
in Figure 4.7. Note that bit addressing is used and a “1” in the control vector
permits storing the corresponding element in the resulting vector. For example,
the memory location 40005 is stored with a “1,” so Cs is transformed into
As + B_;.The skewing effect is apparent in this example. After the instruction
has been decoded at the stream unit, the appropriate microcode sequence is
initiated by the microcode unit (MIC) in the stream unit.
1. The reading of addresses from the register file (in the stream unit) for the vector
parameters according to designations specified in the instruction
2. The calculation of the effective addresses and field lengths for monitoring the
starting of vector operations
3. The setting up of the usage of read-write buses as specified by the G (sub-
function) field for the operands and results
PIPELINE COMPUTERS AND VECTORIZATION METHODS 247
A source vector
10000 + Base address
10020
Offset
10040
10060
Start address
10080 ~ (base address ~ offset)
100A0
1G0CO
100EO Actual field length
10100 = field length — offset
10120 = 12~ 4 = §halfwords
10140
10160
B source vector .
Starting address
1FF80
IFFAO
IFFCO Offset
20020
20040
20060 tL
C source vector
30000 Co ~-Base address
30020 1?
30040 > Offset
30060 >
30080 C, 32 Cy ~e- Starting address
300A0| C, @ A, +
300C0 a? Gs
300E0 ai Effective
Once effective addresses are computed, the operand elements are fetched and
paired for the operations involved. The static configuration of the execution pipe
will remain active until vector instruction is terminated. A termination is marked
WWW.Gitmgurgaon.blogspot.com
248 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
by either of the following events: (a) a vector is exhausted when the effective field
length becomes 0; (b) some other data fields or strings have been exhausted.
To support vector and scalar processing in Star, its operating system provides
time sharing, utilizing the concepts of virtual memory. Prepaging is allowed by a
feature known as advise to alleviate the [/O-bound problem, The Star operating
system handles the functions of input, compilation, assembly, loading, execution,
and cutput of all programs submitted, as well as the allocation of main memory.
In addition to Star Fortran, an interactive interpreter called Star APE. is also
implemented in the system, which upgrades system capability to handle a large
area of scientific computations,
Instructions in TI-ASC have 32 bits, as shown in Figure 4.8, where F is the
opcode, R, T, and M specify the arithmetic, index, and base registers, and N is the
symbolic address, ASC differs from Star in the way vector instructions are imple-
mented. Instead of using certain registers to retrieve the operand addresses and
control information, the ASC uses a vector parameters file (VPF), which consists
of eight 32-bit registers in the PU, as shown in Figure 4.5. The function of each
VPR register is fixed, as shown in Figure 4.8. The register 1) holds the opcode, the
vector-operand type, and the length; V,, V,, and V, indicate the base address and
the displacement of each operand vector; V, and V, hold the increment of the
vector index and the interaction number of inner loops; and ¥; and V, hold
similar information for cuter loops.
The above control information is loaded into these V registers from the main
memory before the execution of each vector instruction. Microcode will be
8 4 4 4 12
F R T M N
Ky
WJ o- | XA
v,| HS | xB
WI vi | xe
V, DA,
Ks ‘|
Ve DA,
K, DA, @
Figure 4.8 The TI-ASC instruction format and vector parameter file.
PIPELINE COMPUTERS AND VECTORIZATION METHODS 249
Attached processors are becoming popular because their costs are low and yet they
provide significant improvement on the host machines. The AP-120B and FPS-164
are back-end attached arithmetic processors specially designed to process large
vectors or arrays (matrices) of data, Operationally, these processors must work
with a host computer, which can be either a minicomputer (such as the VAX-11
series) or a mainframe computer (IBM 308% series), While the host computer
handles the overall system control and supervises I/O and peripheral devices,
the attached processer is responsible for heavy floating-point arithmetic computa-
tions. Such a functional distribution can result in a 200 times speedup over a
minicomputer, and a 20 times speedup over a mainframe computer. Other scientific
attached processors include the IBM 3838 and the low-cost Datawest processor.
We describe in this section the architectural features of these attached processors
and assess their potential applications in the scientific and engineering areas.
WWW.Gitmgurgaon.blogspot.com
250° COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Display Printer
terminal
I f AP-120B
mterface Array
Host Front panel PTOCESSOT
Control (16 bits) registers Contro] (16 bits)
i
Ca >
“| Switches [* >
Data
Memory
a f Lights 1 memory
A A A 4
e) Host memory | | Array
Tape Host memory Lyity| address | " | processor
file address {up to memor
18 bits) “t AP memory [“* ~ address
address nat
—L Word “ ape
count
eee Control nat >
nT eat
D6 Formac: |
Data bus ( (16 or 32 bits ) DMA oo Data bus (38
{ bits)
registers
control register governs the direction of data transfer and the mode of transfer.
The format register performs conversion between the FLP format of the host and
that of the AP-120B, Interface logic permits data transfer to occur under control
of either the host or the AP-120B. The floating-point format in the AP-120B is
38 bits long, with a 28-bit 2’s complement mantissa and a 10-bit exponent biased
by 512, Using such a format, the precision and dynamic range are improved over
the conventional 32-bit floating-point format. If the host has different floating-
point data formats, the format conversion is done ‘‘on the fly” through the
interface. Consequently, the AP-120B can concentrate on useful computational
tasks.
A detailed functional diagram of the AP-120B processor is shown in Figure
4,10, The processor is divided into six sections, the I/O section, memory section,
control memory, contral unit, data bus, and two arithmeric units, The memory
section consists of the data memory (MD), table memory (TM), and two data pads
(DPX and DPY). The control memory or program memory (PM) has 64-bit words
with a 50-ns cycle time. The program memory consists of up to 4K. words in 256-
word increments, Instructions residinginthe PM are fetched, decoded, and executed
in the control unit. The data memory is interleaved with a cycle time of either 167
or 333 ns. The choice of a particular speed depends on the trade-off between cost
PIPELINE COMPUTERS AND VECTORIZATION METHODS 251
Program PS
Control memory
memory {to dK x 64 bits)
(16, x 16 bits)
address ALU)
Control
units
Memory address
registers
(MA, TMA, DPA)
x 16 bits Floating-point
adder
Al
Table
memory FA
RAM or ROM
to 64K x 38 bits
A2
Main data MD
memory
tol x 38 bits Floating-point
multiplier
M1
Host
FM
w- interface...
Switch REG
Function REG
ts REG
Vo
section
10P
16/32
PIOP INBS
Figure 4.10 The block diagram of an AP120B processor. (Courtesy of Floating-Point Systems, Inc.)
WWW.Gitmgurgaon.blogspot.com
252 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
and performance. The data memory is the main data storage unit with 38-bit
words. It is directly addressable by the | million words in 2K-word (167 ns) or 8K-
word (333 ns) increments. The TM has up to 64K 38-bit ROM or RAM words of
167-ns cycle time. The table memory is used for the storage of frequently used
constants (e.g, FFT constants). It is associated with a special data path which
does not interfere with the data path associated with the data memory. The data
pads X and Y are two blocks of 38-bit accumulators. There are 16 accumulators in
each block. These accumulators are directly addressable by the AP processor. Any
accumulator can be accessed in a single machine cycle of 167 ns. Simultaneous
read and write are possible in each data pad within the same cycle,
The § pad in the control unit contains two parts: an S-pad memory and an
integer ALU. The S-pad memory contains 16 directly addressable integer registers.
These registers feed the address ALU to produce effective operand addresses. The
address ALU performs 16-bit integer arithmetic. The outputs of the address ALU
can be routed to any one of the following address registers: MA for the data
memory, TMA for the table memory, and DPA for the data pads. Other functions
of the address ALU include clear, increment, decrement, logical and, and logical or.
Two pipeline arithmetic units are the FLP adder (FA) and the FLP multiplier
(FM). The FA consists of two input registers, Al and A2, and a two-segment
pipeline, as shown in Figure 4.11. Fhe sum output is a 38-bit normalized floating-
point number. The FM has M1 and M2 input registers and a three-segment
pipeline which performs floating-point multiply operations. Once the pipeline is
Buffer
Normalize
and
round
(FA)
M2 A2 MI DPX DPY
full, a new result (sum or product) is produced for every machine cycle of 167 ns.
Consequently, the maximum throughput rate for the AP-120B is 12 mega floating-
point computations per second,
The AP-120B derives its high computing power from multiplicities in all
sections of its processor organization. It uses two pipeline arithmetic units (FA
and FM), one integer ALU, multiple memories (PM, MD, TM) which can be
independently addressed, a large number of registers and accumulators (Al, A2,
M1, M2, MA, TMA, DPA, DPXs, and DPYs), and seven data paths, as shown in
the bus structures section of Figure 4.10.
The two floating-point arithmetic units FA and FM can operate simultaneously
and the 16-bit integer ALU can operate independently of the FA and FM. The use
of two independent blocks of accumulators (DPX and DPY) provides the desired
flexibility in handling operands and intermediate and final results. For instance,
each block can hold a vector operand with 16 components so that a 16-element dot
product can be performed within the FA and the FM in pipeline mode. In other
cases, one block provides data for the FA or FM, while the other block transfers
data to and from data memory or table memory.
The pipeline structures of the FA and FM are described below, The first stage
of the FA compares exponents, shifts the fraction of the smaller number, and adds
the fractions. In the second stage, the resulting fraction is normalized and rounded.
Because of different processing speeds in the two stages, a buffer is inserted to hold
the intermediate result. The output of the FLP adder, denoted by FA, can be
routed to five diflerent destinations. Possible source connections te the input
registers Al and A2 are shown at the top of Figure 4.11. The FM has three stages,
In the first stage, the 56-bit product of the two 28-bit fractions is partially com-
pleted. The second stage completes the product of the fractions. The third stage
adds the exponents, rounds, and normalizes the fraction of the product. All possible
source and destination connections to the FLP multiplier are identified in Figure
4.12,
Seven buses are used in the AP-120B simultaneously to enable parallel
processing. Both the FA and the FM have multibus input ports. In other words,
multiple operands and results can be moved between different functional units at
the same machine cycle. Thereby, the total data path bandwidth will match the
execution speed of the pipeline adder and multiplier.
Several levels of parallelism in the AP-120B have been described. Another
aspect worthy of mentioning is the control of parallel functional units. This is
provided by the long instruction word of the AP-120B. An AP-120B instruction
has 64 bits, which are subdivided into 10 command fields (Figure 4.13). Each
command field controls a specific unit; therefore, a single AP-120B instruction
can initiate as many as ten operations per machine cycle, as listed in Figure. 4.13.
Multiple memory accesses, register transfers, integer arithmetic, and floating-point
computations can occur at the same time.
In summary, multiple memories, multiple functional units, parallel data
paths, and the multiple command fields in the instruction have made the AP-120B
a fast attached processor for scientific computations.
WWW.Gitmgurgaon.blogspot.com
254 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Buffer 2
Complete
product of
fractions
Buffer 3
Add exponents
Normalize
and
Round
(FM)
MI AI MI DPX DPY
Figure 4.12 The floating-point multiplier
(Possible destination connections} in AP-120B,
Directs operation of 16-bit control ALU Directs operation of Directs conditional branches
and associated registers floating. point adder
‘ Multiplier
Accumulator group group Memory group
. Al+ AZ
Rw NS
Al — A2
KH DORIA
A2 — Al
Al EQV A2
Al AND A2
Al OR A2
. Convert A2 from signed magnitude to 2’s complement format
. Convert A? from 2’s complement to signed magnitude format
. scale A2
. Absolute value of A2
. Fix A2
=
Travel Pipeline
Operation time interval
WWW.Gitmgurgaon.blogspot.com
256 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Matrix operations
Matrix transpose MTRANS 0.8 17
Matrix multiply MMUL * 38
Matrix multiply (dimension < 32) MMUL32 * 27
Matrix inverse MATINV * 130
Matrix vector multiply (3 x 3) MYML3 2.5/veetor 30
Matrix vector multiply (4 x 4) MYML4 4,6/vectot 39
* Timing unknown.
Similarly, the FM can perform many different functions, The timing for some
floating-point arithmetic operations in the AP-120B is summarized m Table 4.7,
where the travel time is the total time required to transfer data from source to
destination, and the pipeline interval is the time between successively available
results. The pipeline interval indicates the maximum throughput rate for vector-
oriented computations.
A detailed example of vector processing in the AP-120B is given below.
First some notations are established. A semicolon “ :” separates parallel operations
within an imstruction word. A comma “,” is used to separate operands. A double
slash bar “//"” denotes a comment, An arrow “<” refers to the replacement
PIPELINE COMPUTERS AND VECTORIZATION METHODS 257
operator for data transfers. Some operations required in presenting the example
are specified below:
FADD A1,A2 /fA1+A2 Gloating-point add)
DPX(m)- FA //Save FA in location m of data pad DPX.
FMUF = -M1,M2 //M1*Mz2 (floating-point multiply)
DPY(m)+-FM //Save FM in location m of data pad DPY.
where the inputs Ai, A2, M1, and M3 to the adder and multiplier come from the
input sources specified in Figures 4.11 and 4,12, respectively.
Example 4.2 The following sequence is used to compute the dot product of
4
two vectors, s X;Y,, where X, and Y, are obtained from DPX and DPY,
te@
respectively. The resulting sum of the products is to be stored in DPX:
(1) FMUL DPX(0), DPY(0} /fMultiply X,Y.
(2) FMUL DPX(1), DPY(1) //Multiply X,Y,.
(3) FMUL DPX(2), DPY(2} //Multiply X,Y.
(4) FMUL DPX(3), DPY(3); //Muitiply X,Y,, X,Y, is now done.
Save it in adder.
FADD FM, ZERO
(5) FMUL DPX(4), DPY(4); //Multiply X,¥,,X,¥, is now done.
oo a at
Save it in adder.
FADD FM, ZERO
(6) FMUL: FADD FM, FA {/X,¥, is coming out of the multi-
plier and X,Y, from the adder. Add
thern together.
(7) FMUL; FADD FM, FA //X,,¥,, is coming out of the multi-
plier and X,Y, from the adder. Add
them together.
(8) FADD FM, FA /{X,¥, is coming out of the multi-
plier and (X,¥,+X,Y,) from the
adder, Add them together.
(9) FADD; DPX(4)—FA /1{X,¥,+X,¥) is coming out of the
adder. Save it in DPX{4).
(10) FADD DPX(4), FA [1X g¥gtX2Y,+X,¥,) is coming out
of the adder. Add it to (X,¥,+X,Y¥,).
(11) FADD //Push result out of adder pipeline.
(12) DPX(4)—FA
a
WWW.Gitmgurgaon.blogspot.com
258 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
The dummy add “FADD” without arguments in cycle 11 is used only to push the
last computation out of the pipeline. Remember that there are three stages and
two buffer registers in the FM pipeline, hence two dummy multiplies are needed
to push the last two computations out of the pipeline. For long vectors, the speed
to execute dot product in the AP-120B is much faster than in a serial processor.
The AP-120B has been applied extensively in the field of digital-signal pro-
cessing. The execution sequence of fast Fourier transform (FFT) in the AP-120B is
shown below as an example. The FFT program resides in the program memory of
the AP-120B. The array of data to be transformed is stored in the main memory of
the host computer. The FFT computation sequence consists of the following steps:
1. The host computer issues an I/O instruction to initiate the FFT program in
the AP-120B.
2. The AP-120B requests host DMA cycles to transfer the array of data from host
memory to data memory in the AP-120B. The floating-point format is converted
during the flow of data through the interface unit.
. The FFT computations are performed over a 38-bit floating-point data array.
an
4, The AP-120B requests the host DMA cycles to return the results of the FFT
frequency-domain coefficients array.
Example 4.3 The above operations are called by a host machine with the
following four Fortran statements:
CALL APCLR //Clear AP-120B,
CALL APPUT (+: } //Transfer data to AP-120B.
CALL CFFT (°° ) //Pertorm FFT.
CALL APGET (--:-:: } //Transfer results to host.
where “+--+ * denotes the parameters used in the routines.
For the convolution of two arrays, say A and B, all required operations can
also be done by the AP-120B. Once the transfer of data arrays is initiated, there is
no need to wait until completion of the entire array transfer. Such convolution
requires a sequence of forward FFT and inverse FFT operations, as listed below:
attached to either the input-output channel or the DMA channel of a host com-
puter by means of a hardware and software interface similar to that for the AP-
120B. The host machine can be a DEC VAX 11/780, an IBM 4341, or an IBM 3081,
ranging from superminis to large mainframes. The FPS-164 improves its per-
formance over the AP-120B by extended precision (64-bit floating-point numbers
instead of 38 bits, as in the AP-120B) and a much enlarged memory of 16 million
64-bit words. The FPS-164 can be programmed with either a Fortran-77 subset,
FPS-164 symbolic assembly language, or the extensive library of preprogrammed
mathematics, matrix, and applications routines,
A functional block diagram of the FPS-164 is given in Figure 4.14. There are
eight independent pipeline functional units (the FLP multiplier, the FLP adder,
the data pads X and Y, table memory, main memory, integer ALU, and the
data pad bus) interconnected by seven dedicated data paths. The peak speed is
stiff 12 megaflops. The 64-bit data word provides 15 decimal-digit accuracy. The
64-bit address space covers 16 million words. Multi-user protection is provided by
using memory base and limit registers and privileged instructions. The vectored
priority interrupts allow real-time applications. The dynamic range and accuracy
of the FPS-164 improves significantly over the AP-120B. Furthermore, the
processor has instructions which assist software implementation of double-word
floating-point arithmetic. Diagnostic and reliability features are also built into
the FPS-164 to enhance dependability of the system in case of hardware or software
failures.
The IBM 3838 is a multiple-pipeline scientific processor. It is evolved from
the earlier IBM 2938 array processor. Both processors are specially designed to
attach to IBM mainframes, like the System/370, for enhancing the vector-process-
ing capability of the host machines. These attached pipeline processors reflect
recent progress in scientific processing at IBM beyond the level of the 360/01 and
the 370/195. Vector mstructions that can be executed in the 3838 include the
componentwise vector add, vector multiply, the inner product, the sum of vector
components, convolving multiply, vector move, vector format conversion, fast Fourier
transforms, table interpolations, vector trigonometric and transcendental functions,
polynomial evaluation, and matrix operations. Like the AP-120B and the FPS-164,
both the IBM 2938 and the 3838 are microprogrammed pipeline processors which
can be supplied with custom-ordered instruction sets for specific vector applica-
tions.
The hardware architecture of the IBM 3838 array processor is shown in
Figure 4.15. The processor can attach to a System/370 via a block-multiplexer [/O
channel with a data transfer rate of 1.5 M bytes per second. With an optional two-
type interface, the maximum data-transfer rate can be doubled to 3 M bytes/s.
The 3838 appears to the host processor I/O channel as a shared control unit.
Up to seven users can be simultaneously active in the 3838. The tasks defined by
each user are pipelined at various subsystems in the 3838. The control processor
can assist the user with a set of scalar instructions and the necessary registers in
preparing vector instructions. The bulk memory is used to hold a large volume of
vector operands. The [/O unit supervises the transfer of data or programs between
WWW.Gitmgurgaon.blogspot.com
260 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Host
vo
computer
device device
1/0 bus
Table
Main memory
memory
Contral
32-bit
Instruction cache 1024 words subroutine stack
registers
Multiple
L
i 64-bit 32-bit
: data address
paths registers
Address
Multiplier Address integer
. unit unit arithmetic
Floating- unit
point
arithmetic units
Figure 4.14 The FPS-164 system diagram. (Courtesy of Floating-Point Systems, Inc.)
PIPELINE COMPUTERS AND VECTORIZATION METHODS 261
snneceneteceeecesonnnes :
pie Vo Arithmetic :
eves nesesces element hoe peneronteeseneees |
3 i controller beereeae os i
; i i Microprogram
e Writable : i control
Processor i contro] Angles :
: storage : i
> i t :
3 Data ; Sin/cos in ‘
5 an transfer po Reciprocator rj Dipeline :
E controller Address (5 stages) (3 stages) i
3 BT] iceneeereeee :
a t
Working ge} FLP adder '
store Floating: (4 sages) F™
(right) 5 point pressaresannerseed
:
g multiplier
Working z (a-stage
store pipeline) _,| FLP adder ;
(left) (4 stages)
Figure 4.18 The arithmetic processor in 1BM/3838. (Courtesy of International Business Machines Corp.)
the host and the bulk memory. Data-word size of the 3838 1s 32 bits, matching that
of the System/370 machines.
The transfer of the working sets of the vector segments between the bulk
memory and the working stores is supervised by the data transfer controller (DTC).
Each workimg store can hold 8192 bytes. Vector-addressing parameters are
supplied to the DTC by the control processor. This DTC is microprogrammed to
generate the effective memory addresses for both the bulk and working memories
before data can be properly transferred. Furthermore, the DTC can perform
data-format conversion during the data flow. The arithmetic controller is also a
‘microprogrammed unit. The microprogram sequences performed by the arithmetic
pipelines are initialized by this controller. The use of the working stores by the
arithmetic pipelines and by the DTC is synchronized. The basic pipeline cycle time
is 100 ns in the 3838,
There are five pipeline arithmetic units in the 3838. The pipeline units as
diagrammed in Figure 4.15 include two floating-point adders of four stages each,
a four-stage floating-point multiplier; a three-stage sine/cosine pipeline; and a
five-stage reciprocal estimator. Even the working stores appear as a four-stage
pipeline. The delay of each stage is 100 ns. The interconnection paths between
these functional pipes are under the microprogrammed control of the arithmetic
WWW.Gitmgurgaon.blogspot.com
262 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
element controller. The access of the writable control storage is also pipelined
with two stage delays.
The programs and data to be processed by the 3838 are prepared by the host
computer. Both vector and scalar instructions can be contained in these 3838
programs, The host sends the programs and data to the 3838 through the I/O
channel. Data will be stored in the bulk store. The instructions will be executed
by the contro! processor. After the decoding of cach instruction, the control
processor provides linked lists of microprogram sequences for supervising the
pipelined execution of the instructions. While the arithmetic pipelines are updating
vector data from one working store, the DTC can load the other working store.
Therefore, data loading and instruction execution can be done simultaneously at
the two banks of the working stores. This facilitates the multiprogrammed use of
the 3838. Concurrent pipelinings allow multiple users to share the hardware
resources in achieving high system throughput. The maximum speed of the 3838
has been estimated to be 30 megaflops.
Datawest, Inc. at Scottsdale, Arizona, has built a very sophisticated attached
processor called MATP for large scientific computations. The MATP consists of
up to four pipeline processors. These processors, forming a hybrid MIMD-SIMD
system, are microprogrammable and share a common data memory. Each pro-
cessor can be controlled by separate writable control stores. The primary means
of host communication is through a set of program channels that connect to host
I/O channels.
A schematic functional block diagram of the Datawest MATP is shown in
Figure 4.16. This processor is designed to work with a Univac 1184 computer.
Using a Univac and an MATP at a cost of $4 million, Datawest claims that it can
attain a peak rate of 120 megaflops. This compares favorably with the 160-mega-
flop Cray-1 with a $10 million cost. The Fujitsa FACOM 230/75 is another
attached array processor with a peak performance of 22 megaflops when attached
to a FACOM 200M mainframe.
A comparison of three competing attached processors manufactured in the
United States is given in Table 4.8. All three processors, FPS’ AP-120B, IBM’s 3838,
and Datawest’s MATP, are pipelined and microprogrammapble. The speeds shown
are theoretical peak speeds in megaflops. The speed of the MATP corresponds to
a maximum configuration of four processors. It is interesting to note the multi-
processor structure in the MATP. This concept of pipelining in a multiprocessing
mode is also seen in other supercomputers like the Cray X-MP and HEP to be
introduced in Chapter 9.
Attached array processors are effective in seismic-signal processing. If one
enlarges the instruction repertoire of array processors, they can be turned into
general-purpose scientific processors. The attempt by Datawest ts a good example.
Most scientific computers remain outside the mainstream of developing large
computers for business use. The peak speed shows only a theoretical limit. [t ts
the degree to which parallelism is exploited in the application programs that
determines the effectiveness of a scientific processor. In general, attached processors
have specialized architectures that appeal better to programs containing many
PIPELINE COMPUTERS AND VECTORIZATION METHODS 263
CPU
Data
channel
Addresses 4 L -
Data Data s
Y
Th TATE data
r ¥ , ‘
4 j AData and
, , ¥ address
hb 4 4 Cantrol
¥ ¥ ¥ yand data
Control
WWW.Gitmgurgaon.blogspot.com
264 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
The three most recently developed vector processors are described in this section,
namely Cray Research’s Cray-1, Control Data’s Cyber-205, and Fujitsu’s VP-200.
All three are commercial supercomputers with multiple pipelines for concurrent
scalar and vector processing. Possible extensions to these vector supercomputers
will be elaborated at the end. We focus on the architectural structures, special
hardware functions, software supports, and parallel processing techniques that
have been developed with these second generation vector processors.
front end, which is connected to the Cray-1 CPU via I/O channels. Figure 4.17
shows the front-end system interface and the Cray-1 memory and functional
sections. The CPU contains a computation section, a memory section, and an I/O
section, Twenty-four I/O channels are connected to the front-end computer, the
I/O stations, peripheral equipment, the mass-storage subsystem, and a maintenance
control unit (MCU). The front-end system will collect data, present it to the Cray-1
for processing, and receive output from the Cray-1 for distribution to slower
devices. Table 4.9 summarizes the key characteristics of the three sections in the
CPU of the Cray-1.
The memory section in the Cray-] computer is organized in 8 or 16 banks with
72 modules per bank. Bipolar RAMs are used in the main memory with, at most,
one million words of 72 bits each. Each memory module contributes | bit of a
72-bit word, out of which 8 bits are parity checks for single error correction
and double error detection (SECDED). The actual data word has only 64 bits.
Sixteen-way interleaving is constructed for fast memory access with small bank
conflicts. The bipolar memory has a cycle time of 50 ns (four clock periods). The
transfer of information from this large bipolar memory to the computation section
can be done in one, two, or four words per clock period. With a memory cycle of
50 ns, the memory bandwidth is 320 million words/s, or 80 million werds per clock
period.
Compuiation section
¢ Registers
* Functional units
* Instruction buffers
Memory section
0.25 M or 0.5 M orl M
64-bit. bipolar words
1/0 section
* [2 input channels
e 12 output channels
Front-end computers,
Mass storage
I/O stations, and
MCU subsystem
peripheral equipmem
WWW.Gitmgurgaon.blogspot.com
266 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Such high-speed data-transfer rates are necessary to match the high processing
bandwidth ef the functional pipelines.
The HO section contains 12 input and 12 output channels. Each channel has
a maximum transfer rate of 80 M bytes/s. The channels are grouped into six input
or six output channel groups and are served equally by all memory banks. At
most, one 64-bit word can be transferred per channel during each clock period.
Four input channels or four cutput channels operate simultaneously to achieve
the maximum transfer of instructions to the computation section. The MCU in
Figure 4.17 handles system initiation and monitors system performance. The mass
storage subsystem provides large secondary storage in addition to the one million
bipolar main memory words.
A functional block diagram of the computation section is shown in Figure
418. It contains 64 x 4 instruction buffers and over 800 registers for various
purposes. The 12 functional units are all pipelines with one to seven clock delays
except for the reciprocal unit, which has a delay of 14 clock periods, Arithmetic
operations include 24-bit integer and 64-bit floating-point computations, Large
Vector
((AO) + (Ak)
Ad
Vector
functional
units
Multi
Vector
A
contro]
Floating-
VM point
funeti
RTC units
Scalar registers
TOO
(AQ)
through
T77 A
((Ahj + jkm) 5,
Exchange Scalar
control functi
units
Vector
(A0) KA ontrol
Address registers
BOO A
A, L
through
B77
Ba
Address
31 functional
units
(Ah) +
vo
control
CIP
Sa 1
Execution
ane
Eastruction
buffers
Figure 4.18 Arithmetic logic pipelines, registers, buffers, memory, and data paths in the Cray-1. (Courtesy
of Cray Research, Inc.)
267
WWW.Gitmgurgaon.blogspot.com
268 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
ge At jiok m
pe— First parcel-—++ Second parcel>
(a) Instruction fields in Cray-1
g A i i &
3 3 3
YOY
Opa Operation code OY
Result Operand Operand
register register register
g h i ik
Operation code
Operand
4
Shift, mask count
and result
register
(c) Unary vector instruction Figure 4.19 Instruction format of the Cray-1.
fetched per clock period to the least recently used instruction buffer. To allow fast
issuing of instructions, the memory word containing the current instruction is the
first to be fetched. The Cray-1 has 128 instructions with 10 vector types and 13
scalar types, the majority of which are three-address instructions (Figure 4.195).
Figure 4.19¢ shows the format of a unary vector instruction.
The P register is a 22-bit program counter indicating the next parcel of program
code to enter the next instruction parcel (NIP) register ina linear program sequence.
The P register is entered with a new value on a branch instruction or on an exchange
sequence. The current instruction parcel (CIP) register is a 16-bit register holding
the instruction waiting to be issued. The NEP register is a 16-bit register which
holds a parcel of program code prior to entering the CIP register. If an instruction
has 32 bits, the CIP register holds the upper half of the instruction. The lower
instruction parcel (LEP) register, which is also 16 bits long, holds the lower half.
Other registers, such as the vector mask (VM) registers, the base address (BA), the
limit address (LA) registers, the exchange address (XA) register, the flag (F) register,
and the mode (M) register, are used for masking, addressing, and program control
purposes.
The twelve functional units in the Cray-] are organized into four groups:
address, scalar, vector, and floating-point pipelines, as suramarized in Table 4.10,
Each functional pipe has several stages. The register usage and the numberof
pipeline stages for each functional unit are specified in the table. The number of
WWW.Gitmgurgaon.blogspot.com
270 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
required clock periods equals the number of stages in each functional pipe. Each
functional pipe can operate independently of the operation of others. A number of
functional pipes can operate concurrently as long as there are no register conflicts.
A functional pipe receives operands from the source registers and delivers the
result to a destination register. These pipelines operate essentially in three-address
mode with limited source and destination addressing.
The address pipes perform 24-bit 2’s complement integer arithmetic on
operands obtained from A registers and deliver the results back to A registers.
There are two address pipes: the address add pipe and the address multiply pipe.
The scalar pipes are for scalar add, scalar shift, scalar logic, and population-leading
zero count, performing operations over 64-bit operands from § registers, and in
most cases delivering the 64-bit result to an S register. The exception is the
population-leading zero count, which delivers a 7-bit integer result to an A register.
The scalar shift pipe can shift either the 64-bit contents of an S register or the
128-bit contents of two S registers concatenated together to form a double precision
word. The population count pipe counts the number of “1” bits in the operand,
while the leading zero count counts the number of “0” preceding a | bit in the
operand. The scalar logical pipe performs the mask and boolean operations.
The vector pipes include the vector add, vector logic, and vector shift, These
units obtain operands from one or two V registers and an S register. Results from
a vector pipe are delivered to a V register. When a floating-point pipe is used for a
vector operation, it can function similar to a vector pipe. The three floating-point
pipes are for FLP add, FLP muitiply, and reciprocal approximation over floating-
point operands. The reciprocal approximation pipe finds the approximated
PIPELINE COMPUTERS AND VECTORIZATION METHODS 271
WWW.Gitmgurgaon.blogspot.com
272 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
path between memory and working registers can be considered a data transmit
pipeline with a fixed-time delay.
When a vector instruction is issued, the required functional pipes and operand
registers are reserved for a number of clock periods determined by the vector
length. Subsequent vector instructions using the same set of functional units or
operand registers cannot be issued until the reservations are released. Two or
more vector instructions may use different functional pipelines and different vector
registers at the same time, if they.are independent. Such concurrent instructions
can be issued in consecutive clock periods. Figure 4.21la shows two independent
instructions, one using the add pipe and the other using the multiply pipe, Figure
4.21b depicts the demand on the add pipe by two independent vector additions.
When the first add instruction is issued, the add pipe is reserved. Therefore, the
issue of the second add instruction is delayed until the add pipe is freed. Figure
4.2le shows two different vector instructions sharing the same operand register
V,. The first add instruction reserves the operand register V,, causing the issue of
the multiply instruction to be delayed until the operand register V, is freed. Figure
4.21d illustrates the reservations of both the add pipe and the operand register V,:
Like the reservation required for operand registers, the.result register needs also
to be reserved for the number of clock periods determined by the vector length
and the pipeline delays. This reservation ensures the proper transmittal of the
final result to the result register. .
A result register may become the operand register of a succeeding instruction.
In the Cray-1, the technique is called chaining of two pipelines. Pipeline chaining
We kth
Va Rt Ks
Wee V+ VF,
Vow Vy t Vy
() Functional unit reservation
Wye V+ Vy
Woe kt Ky
Reh th
vy < Vi. + Vy
Figure 4.21 The reservation of functional
(¢) Functional unit and operand register reservations units and operand registers.
WWW.Gitmgurgaon.blogspot.com
274 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Example 4.4 The following sequence of four vector instructions are chained
together to be executed as a compound function:
In a vector operation, the results are normally not restored to the same vector
register used by the source operands. Under certain circumstances, it may be
desirable to route results directly back to one of the operand registers, Such
recursive operations on functional pipelines require special precautions to avoid
the data-jamming problem. To see how recursive computation can be realized in
a pipeline, component operations must be properly monitored. Associated with
each vector register is a component counter. When a vector instruction is issued,
all component counters are set to zero. Normally, sending an operand from a
source register to a functional pipeline causes the associated component counter
PIPELINE COMPUTERS AND VECTORIZATION METHODS 275
Memory
fetch
pipe
Vector
add
pipe
Right
shift
pipe
Logical
product
pipe
Figure 4.22 A pipeline chaining example in Cray-t, (Courtesy of Cray Research, Inc., Johason 1977.)
WWW.Gitmgurgaon.blogspot.com
WP HnsLy of aydiwexa Suyuey ayy 10) wesBeIp Bo, €Z7"p oan
SA JO WHoulaza OF 1NSAd LUI] JO SURI] + ZA JO 1USlUD}9 O} TUN jeUoTIOUAy ppe I939}U] Wooly Wns Jo HsuB of
ium [etonouny [eoIFo] Aq pautiojied uonesado pesrFoy ty HUN FRUCTISUNY PPE Iasaiur Aq uns yo uoneindwes :a
THUR [EUOTIOUNY }EOISO] OF pA DUE CA UT S1USTUS[a puBlado jo JIsuBa] if YUN [BUCHIUTY Ppe JaBoIUF 0) 1A Plte QA UT syuaI[a puesado Jo Tsuen 7p
£A Jo Wslss[9 O1 UR PeUOHOUN] 1yTYs Wolf Wms payyies 7O Ysted) cy OA JO JuSla]9 OF (JUN [PUOTIIUNY peri, WO Prom AsOlHaW JO TISsuRT] 22
Hun [edonouny yrs Aq powuojiod uonesado iTys ty «JUN [ELONOUR) peo, Ysneig: prom KIOWA Jo 1isuelL ig
YUN [BHONOUN) 14s O1 ZA UE ivaillaja puesedo Jo ysuer 1a «UR [BUONSUN Pei, O1 PlOM AIOWIOU jO yIsuRA] cD
i x fob y a ff 2 poo q ?
Bey neces wees Soren rans +: NERC Ee
276
PIPELINE COMPUTERS AND VECTORIZATION METHODS 277
Example 4.5 Let 4 and B be vectors of length N. Consider the following loop
operations:
DO 10 |=1,N
10 A(I)=5.04B(I)+C
When N is 64 or less, a sequence of seven instructions generates the A array:
5, - 50 Set constant in scalar register
S.-C Load constant C in scalar register
VLeN Set vector length into VL register
y<- B Read B vector into vector register
V,— 8S, 8h Multiply each component of the B array by a constant
V,<- 8, + ¥, Add C to 5 « BU)
AcD, Store the result vector in A array
The fifth to sixth instructions use different functional pipelines with shared
intermediate registers. They can be chained together. The outputs of
the chain are finally stored in the 4 array. When N exceeds 64, vector loops
are required. Before entering the loop, N is divided by 64 to determine the
loop count. If there is a remainder, the remainder elements of the A array are
generated in the first loop. The loop consists of the fourth to the seventh
instructions for each 64-element segment of the 4 and B arrays.
WWW.Gitmgurgaon.blogspot.com
278 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
clo Ay fa | tag fs g's [6 | 7) Se 9 peo tanya 43 Sie pis jie faz jag Ag teery
VW,
vO 1
5
Figure 4.24 A timing chart showing the recursive sammation of the vector components in Example 4.6.
Example 4.6 Consider the use of the floating-point add pipe for the recursive
vector summation VO — VO+4 V1, where the vector register V1 holds an
array of floating-point numbers to be added recursively. The timing chart of
the recursion is shown in Figure 4.24, Initially, both counters Cy and C, are
set to be zero. The initial value in the first component register VO, of VO
is also set to zero. The F LP add pipe requires six clock periods to pass through.
Register transfer to or from the FLP add pipe takes another clock period.
Therefore, the total cycle is 1 + 64+ 1 = 8 clock periods, as shown in Figure
4.24. The vector-length register is assumed to have a value of 64 for a single
vector loop.
The counter Co is kept at zero until time r,. During this cycle, VQ,
(which is 0) is sent through the pipeline. However, the counter C, keeps
incrementing after each clock period. Therefore, V1p, Vi,,.-., W163 are sent
to the pipeline in subsequent 64 clock periods after ty. After tg, Cy gets incre-
mented by one after each clock period. This means the successive output sums
are added recursively with one additional component from V1 in every eight
clock periods. When the computations are completed, the component registers
of VO should be loaded as shown in Table 4.11. The 64 components are
divided into eight groups of eight component sums each. The last summation
group from VO0., to V0,; holds the eight summations of the eight components
of VI, each.
PIPELINE COMPUTERS AND VECTORIZATION METHODS 279
WWW.Gitmgurgaon.blogspot.com
280 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
The performance of the Cray-1 may vary from 3 to 160 megaflops, depending
on the applications and programming skills. Scalar performance of 12 megaflops
was observed for matrix multiplication. Vector performance of 22 megafiops was
observed in vector dot product operations. Supervector performanee of 153
megaflops was observed in assembly-code matrix multiplication. These speeds are
special peak values. The Cray-1 will more likely have an average vector-super-
vector performance in the range of 20 to 80 megaflops, depending of course on the
work load distribution.
In order to achieve even better supercomputer performance, Cray Research
has extended the Cray-i to the Cray X-MP, a dual-processor system with loosely
coupled multiprogramming and single-program multiprocessing. The Cray
X-MP has eight times the Cray-1 memory bandwidth and a reduced clock period
of 95ns. It has guaranteed chaining. Furthermore, the software for the Cray
X-MP is compatible with that of the Cray-1. The first customer shipments of the
Cray X-MP took place in 1983, with full production in 1984.
The Cray X-MP offers impressive speedup over the Cray-1. For mixed jobs,
it has been estimated that the Cray X-MP has a 2.5 to 5 times throughput gain
over the Cray-L. For scalar processing, it is 1.25 to 2.5 times faster than the Cray-1.
Again, it is excellent for both short and long vector processing, as is the Cray-I.
When the Cray X-MP gets upgraded to the Cray-2 after the mid-80s, the per-
formance is expected to increase six times in scalar and 12 times in vector operations
over the Cray-1. The Cray-2 will have four processors with a basic pipeline clock
rate of 4ns, 32 M words of main memory, and 20 times improved 1/0. We will
study various Cray Research’s mulliprocessors in detail in Chapter 9.
Scalar
1 million ~t kK
words ~
.; dene z a
Stream Vector
: \ unit arithmetic pipes
pocteeeeeseses : n
1 million
t : t
>|wt | [Pipe
Pipe1]
|
option e
Losses ;
Ab
f
VO U6]. Maintenance
_ pt control
ports .
unit
Figure 4.25 The Cyber-205 computer system configuration. (Courtesy of Control Data Corp.)
higher than that in the Cray-1. The high memory bandwidth is needed to support
the memory-to-memory pipeline operations.
Instruction-execution control resides in the scalar unit, which receives and
decodes all instructions from memory, directly executes scalar instructions, and
dispatches vector and string instructions to the four vector pipes and the string unit
for execution. It also provides orderly buffering and execution of the data and
instructions. With independent vector and scalar instruction controls to a single-
instruction stream, the scalar unit can execute scalar instructions in parallel with
the execution of most vector instructions.
The scalar arithmetic unit contains five independent functional pipes for
add/subtract, multiply, tog, shift, and divide/sqrt operations over 32- or 64-bit
scalars. The peak speed of the scalar processor is 50 megaflops. The vector pro-
cessor has the option of having one, two, or four floating-point arithmetic pipes.
The stream unit manages the data streams between central memory and the vector
pipelines. A vector arithmetic pipe can perform add/subtract, multiply, divide,
sqrt, logical, and shift over 32- or 64-bit vector operands. Each vector pipeline
is directly connected to the main memory without using vector registers (Figure
4,.25a). The string unit processes the control vectors during streaming operations.
It provides the capability for BCD and binary arithmetic-address arithmetic and
boolean operations.
WWW.Gitmgurgaon.blogspot.com
282 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
The vector startup time in the Cyber-205 is much longer than that of the
Cray-1. A vector may comprise up to 65,635 consecutive memory words. Control
vectors are used to address data that is stored in nonconsecutive locations. Each
pipeline receives two input streams and generates one output stream of floating-
point numbers. Each stream is 128 bits wide, supporting a 100-megaflops computa-
tion rate of 32-bit results or 50 megaflops for 64-bit results per each vector pro-
cessor. With four vector pipes, the Cyber-205 can produce 200 megafiops for 64-bit
results and 400 megaflops for 32-bit results in vector add/subtract or in vector
multiply operations. The Cyber-205 can also be used to perform finked vector
multiply and vector add/subtract operations with a maximum rate of 800 or 400
megafiops for 64- or 32-bit results, respectively, on a four-processor configuration.
The vector divide and square root operations are much slower than the add/subrracr
or multiply operations.
Each vector arithmetic unit consists of five functional pipes, as shown in
Figure 4.26a. The detailed pipeline stages in the floating-point add wait and the
multiply unit are shown in Figure 4.26) and c. Both pipeline units have feedback
connections for accumulative add or multiply operations. These two units are
improved over the designs in the Star-100 (Figure 4.3). The pipeline delays in both
vector-scalar units are summarized in Table 4.12. With a 20-ns clock rate, the
number of required clock periods is also shown in each case. The load/store is also
pipelined. Pipelining produces one result per each clock period of 20 ns. The result
from any of the above units can be routed directly to the input of other units
without stopping in some intermediate registers. This process is called short-
stopping, as facilitated by the feedback connections in Figure 4.25. The theoretical
peak performance of the Cyber-205 is summarized in Table 4.13 for 32- and 64-bit
results.
The bipolar memory is four-way interleaved, giving an effective cycle time of
20 ns per word. The central memory is a virtual memory system with advanced
memory management features such as key and lock for memory protection and
separation, hardware mapping from virtual to physical address, and user program-
data sharing capability. The page sizes vary from small (IK, 2K, 8K words) to
large (65K words). The Cyber-205 has 16 1/O channels, each 16 bits wide. The
1/O system consists of multiple minicomputers for handling up to 10 disk stations.
Load-stare 300 15
Add-subtract 100 5
Multiply 100 5
Legical 60 3
Divide-square root L086 (64 bits) 34
Conversion 600 (32 bits} 36
PIPELINE COMPUTERS AND VECTORIZATION METHODS. 283
Operand A
ts
and B
ts
Addkion unit
p +
Data unl
interchange
Shift wnit
Logical unit
Delay unit
Control
Shortstop
Operand A
b128 bits
Shortstop a
“* Shoristop ~
Operand A
128 bits - Partial sum —
Multiply Panial carry || Merge/ Significance| Result C
complement shift 128 bi
Operand BOY | its
128 bits -
\
WWW.Gitmgurgaon.blogspot.com
284 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Each station can accommodate eight disk drives. Fifty megabaud serial line
interfaces and network access devices can be used to connect the Cyber-205 with
the Cyber net.
Software support for the Cyber-205 consists primarily of the Cyber-200-08,
the Cyber-200 Fortran, the Cyber-200 Assembler META, and Cyber-200 utility
programs. The Cyber-205 Fortran compiler provides. code optimization, loop
collapsing into vector instructions whenever possible, effective utilization of the
large register file, and accessibility to 256 Cyber-205 instructions, divided into 16
vector types and 10 scalar types.
The most obvious architectural improvement of the second generation vector
processors over the first generation is the inclusion of a scalar processor for non-
vector operations. By far, the Cray-1 and Cyber-205 are the fastest processors
manufactured in the United States, with cycle times of 12.5 and 20 ns, respec-
tively. Both systems use bipolar ECL circuits. Only four chip types are used in the
Cray-1 versus 26 chip types in the Cyber-205. Multiple unifunction pipelines are
used in the Cray-1, while the Cyber-205 is equipped with multifunction static
pipelines, In both systems, fast bipolar main memory is used.
To compare the vector-processing capabilities of the Cray-1 and Cyber-205,
we consider the parallel execution of the following program.
Example 4.7
DO 10 1=1, 1024
10 Y(I)=A(1)*B(1)
On the Cray-1, the above DO loop would be computed at a CPU rate of
ne
1000 afi
12.5 ns/answer 80 megaflops
PIPELINE. COMPUTERS AND VECTORIZATION METHODS 285
If two vector pipelines are used in the Cyber-205, then 100 megaflops could
be achieved. The entire operation must be grouped in 16 successive segments
of 64 operations each, since 64 is the loop vector length in both systems.
The Cyber-205 has richer vector instructions than the Cray-1, whereas the
latter has better sealar instructions. With two pipeline processors, the Cyber has
a LO-ns effective burst rate. The Cray-1 can be attached to 1BM, CDC, or Univac
front-end host computers. The Cyber-205 can be driven by the Cyber-170 series or
the IBM 303X. A major difference between the two supercomputers is the all-LSI
chip technology in the Cyber-205 as opposed to the SSI logic parts in the Cray-1,
except for the LS] 4K RAM chips in the Cray-1. The two systems differ also in the
arithmetic pipes, masked vector operations, and in the 1/O linkage operation to
the front-end host.
WWW.Gitmgurgaon.blogspot.com
286 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Array B in “ se oY eS aN
Array A in Fortran order Z re Sia) ‘ a,
Fortran order Sees es
01} 05) 09) 13 O1/ 04 | 097 13
02 | 06) 10) 14 02 | 06) 10) 14 &
03107141) 15 03) 07) 11115
oa | 08] 12] 16 | | o4[08 | 12] 16
rT [ at | [ 49 |
02 a2 | 59
03 03 5
04 04 $2
Array A viewed 05 | OS 53)
as vector of [06 | 06 $4
one
length 16 po
07 aa
07 55 J
08 | 08 56
09 09 57
10 10 58
il ty 59
12 12 60
13 A 1
14 ‘4 | 62 |
15 1 63
16 |
i6!
beesnaad banca iad
64
Lasassnananaed
a Array B viewed as
18 four vectors each
Rear of length 16
Array B viewed
as single vector
of length 64
ng : ans :
40] 411421 43] 44] 45] 46] 47 Bank§
32| 33! 34) 35) 36) 37| 38/39 Bank 4
241 25 | 26) 27! 281 29) 30} 3 Bank 3
16] 1718449720) 24) 22 | 23 Bank 2
08) 09) 10; 11512713) 14] 15 Bank |
OOF GL] 02[ 03) G4; 05106) 07 Bank 0
od,
+ : + + +
x x x
+ + +
1, Assist in memory management: The operating system can commit and de-
commit arbitrary blocks of real memory without having to ensure the physical
contiguity of a user's workspace. This reduces overhead for actions such as the
accumulation of unused space, which could be quite costly in large memory
systems (on the order of eight million words).
. Provide identically appearing execution of all jobs: This means that a program’s
bo
WWW.Gitmgurgaon.blogspot.com
CZRGL HODUT] ‘svasndiso-y -suBaz 37777 Jo Asayane:y) “Goz-1aqAD Ap Ul Buynyadid soj9a4 AsowaUI-0)-A10UlaTY 87'p una
AIO
005)
soygng Inding.
qo
<5 ID
00Se
99 UOHATISUE 101934, RAWAL
eppy o«< a+VvVaddav Za
ld
Ld iv
0OSV
aa BY aa
68 6V
01d ov ev
iv )
audi
ess sia] Jnq
induy
ee
288
PIPELINE COMPUTERS AND VECTORIZATION METHODS 289
A most significant contribution of the Cyber-205 was the notion that strings of
binary bits (called bit vectors) could be used to carry information about vectors
and could be applied to those vectors to perform some key functions. Since the bit
strings became the key to the vector-restructuring concept, a means had to be
provided to manipulate bit vectors as well as numeric vectors; thus was added the
string functional unit to the hardware ensemble in both the Star-100 and the
Cyber-205. The functions of compress, mask, merge, scatter, and gather were
incorporated. In addition, many of the reduction operations like sum, product, and
inner product were implemented directly in the hardware.
Two special vector instructions using the control bit vectors are illustrated in
Figure 4.29. In part a, the two source vectors A and B are merged under the control
of the bit vector C to give the result vector R. The merging is conducted so as to
select from A on “1” in C, and from B on “0” in C. In part b the source vector A
A
Source vector A(OIAC PAQHAGHACIAGHA(GHAC? AQ)
—
a
tot |
Result vector fREORERA]REARATRS|RE|RE] | | ]RED
__t T a
, oo
Source vector — [BO] BCD] BBA BEA] BEs)| Bee] BT) fh BG)
Bivecor “TT Politiloelilolol qi
(a) Tne MERGE
instruction
R
rry t
Result vector [REDJROVIR) Re) | Rin)
(b) The COMPRESS
instruction
Figure 4.29 Two vector instructions in the Cyber-205 using control vectors.
WWW.Gitmgurgaon.blogspot.com
290 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
is being compressed to give the result vector R under the bit vector B. The compres-
sion is done on “0” in B. These instructions are extremely useful in manipulating
sparse matrices.
Startup time of vector operations includes the instruction translation time and
the delay imposed by fetching and aligning the input streams and by aligning and
moving output streams to memory. Figure 4.30 illustrates the impact of startup
time on the effective performance of the Cyber-205 architecture. A major improve-
ment in the overall performance of this type of memory-to-memory vector archi-
tecture requires careful attention to startup time. In addition to the raw improve-
ment in clock speed, the designers were directed to other methods for reducing
the delay in vector initiation.
The actions of address setup and management of vector arithmetic control
lines require substantial speedup. In addition, providing separate and independent
functional units in each of the scalar and vector processors permits execution of
those functions in parallel. Hence, the apparent startup time becomes smaller than
what the hardware provides. This feature, in the best cases, results in parallel
execution of both scalar and vector floating-point operations with a consequent
increase in overall performance.
Identified below are three major changes in the architecture of the Star-100 to
yield the Cyber-205:
1. In the Star-100, the operation of the scalar unit was coupled with the vector
unit such that only one type of operation could be performed at a time. The
Cyber-205 has the ability to run both vector unit and scalar unit in parallel.
2. To fit a variety of operating environments, the Cyber-200 family was provided
with a range of small page sizes beginning with 4096 bytes and ending with
65,536 bytes. The large page size of the Star-100 was retained since it appeared
to be optimum for large production programs.
3. The input-output system employed in the Star-100 was of the “star network”
type, with node-to-node communication between the CPU and attached
peripherals. The change to a network form of I/O which is called the loosely
coupled network (LCN), was a major switch for hardware and software alike
on the Cyber-205.
Figure 4.3] illustrates the star network connection of the Star-100 and attached
peripherals and contrasts this with the Cyber-205. Note that the connectivity of
the Cyber-205 is potentially much greater than that of its predecessor. In addition,
data transfers between elements of the Cyber-205 system can bypass the front-end
elements; in the Star-100, data rates between permanent file storage and the CPU
are limited by the capacity of the front-end processor, which typically is on the
order of1 to 4 Mbits/s.
Transmission of data is accomplished on a high-speed, bit-serial trunk to
which 2 to 16 system elements can be coupled. The method of establishing a link
is based on addresses in the serial message which can be recognized by the hard-
ware trunk coupler, called the netwerk access device (NAD). The most significant
PIPELINE COMPUTERS AND VECTORIZATION METHODS 29%
800
32 bit, 4 pipe
—
700
600
$00
Milops
64 bil,
2 pipe
1 pe
i i ] j J
2K 0 4K GK BK CLK 65K
Vecior iength
800
700
Mfiops
32 bit, 4 pipe
300
32 bit, 2 pipe
200 ree
64 bit, 4 pipe
64 bit, 2 pipe
100
J i i IL i J
2K 4K 6K 8K 10K 65K
Figure 4.30 Effect of startup delay
Vector iength on the Cyber-205 performance.
(Courtesy of [EEE Trans. Com-
(6) Muitipiy-add performance puters, Lincoin 1982.)
WWW.Gitmgurgaon.blogspot.com
292 COMPLITER ARCHITECTURE AND PARALLEL PROCESSING
Mass Maintenance
Storage Control
Station Unit
Start
Central Magnetic
Processing ore
Unit Storage
Mass
Storage
Station
as
Station ,
Control Service
“Unit Station
Magnetic ie | Unit
Tape . = Record
Station € yberstar Access Station
Sees Station
Station
CDC CDC
Cyber 200/ Cyber 18 Cyber 170
MCU Station
Model 205
NAD: network NA
access
device 7639
19 19
1S
1 19
1S 1S.
aspect of this decision has been the philosophical departure from dedicated peri-
pherals to shared peripherals, which are accessed on a “party line” basis. Once
this change has been incorporated in the system software, the actual transmission
media and hardware form of NAD is invisible to the user.
In 1979, Control Data Corporation propesed to the NASA Ames Research
Center a supercomputer design, called the Numerical Aerodynamic Simulation
Facility (NASF), to be used in the 1990s for aerospace vehicle or superjet designs.
The purpose is to provide predictive three-dimensional modeling of the wind
tunnel experiments characterized by viscous Navier-Stokes fluid equations. This
computational approach to solve fluid dynamic problems is only constrained by
processor speed and memory space. It costs much less than building a huge wind
tunnel, which is limited by so many physical factors. The speed requirement of the
NASF was set to be at least 1000 megafiops. Feasibility has been established and
US. government funding is being awaited before proceeding with the design and
construction of such a supercomputer.
The CDC/NASF design extends the structure of the Cyber-205, as shown in
Figure 4.32a. There are five vector pipelines in NASF, with one serving as a spare
unit. Functional components in one vector pipeline are shown in Figure 4.325. A
separate scalar processor is used. The clock rate of this proposed design is 8 ns.
The memory hierarchy has three levels: an 8 M word of ECL, cache, a 32 M word
MOS intermediate memory, and a CCD sequential memory of 128 M words.
Within each vector pipeline, adders, multipliers and complemetters are all
duplicated in pairs to facilitate parallel real or complex number calculations and
error checkings. The spare vector pipe can be switched in automatically whenever
a failure is detected. This allows on-line repairing of the failing unit.
With an 8-ns clock rate, the CDC/NASF can operate with a rate of 500
megaflops for 64-bit results and 1000 megaflops for 32-bit results, Since each result
may be produced with one to three floating-point operations, depending on
whether real or complex operands are involved, the theoretical peak performance
of the NASF should be tripled, as 1500 megaflops for 64-bit results and 3000
megaflops for 32-bit results, Besides CDC, Burroughs has also submitted a proposal!
to build the NASF as an SIMD array processor.
WWW.Gitmgurgaon.blogspot.com
294 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Inputs
| 1_|
| input interface/buffer
4
4 Front-end
4 m7
Front-end
adder adder
-—y ct
Divide
table
' ¥
| Multiplexing interface
| Mulipler | | Multiple | —
| Se
Back-end Back- i
adder adder
t t
/ Output interface/delay
po { | Outputs
[conte ST VO channels
Main memory
Memory hierarchy
backup store Ve ontrol
1 { |
: [ Memory interchange unit | |
oe { fo
C [ Vector streaming unit |
i
'
Seaiar
i t
processor
Vector
pipelines
(one spare)
Figure 4.32 The proposed CDC NASF supercomputer. (Courtesy of Contra} Data Corp.)
PIPELINE COMPUTERS AND VECTORIZATION METHGDS 295
Mask Registers
1KB
Add/ Vector
Logical processor
Main
Vector
storage
registers
256MB Multiply
64KB
Divide
Sealar Sealar
execution processor
unit
GPR
Channels FLPR
Figure 4.33 The FACOM vector process VP-200. (Courtesy of Fujitsu Limited, Japan.)
has a data bandwidth of 267 M wards in either direction. This rate matches the
maximum throughput of the arithmetic pipes. There are four execution pipelines
in the vector processor. Data format for vector instructions can be bit strings,
32-bit fixed-point, and 32- or 64-bit floating-point operands. There are 83 vector
instructions and 195 scalar instructions in the VP-200. Most of the scalar in-
structions are [BM 370 compatible and the scalar unit interfaces with the main
memory via larger buffer storage.
In the VP-2060, the vector registers can be dynamically reconfigured by con-
catenation to assume variable lengths up to 1024 words. Vector instructions include
vector compare, masking, compress, expand, macros, and controls, in addition to
arithmetic logic operations. Concurrent operations include two load-store, mask,
two out of three arithmetic pipes, and scalar operations. The throughput of the
add-logical pipe and the multiply pipe is 267 megaflops each, whereas that of the
divide pipe is 38 megaflops, Hence 533 megaflops is the maximum throughput
when the add-logical and the multiply pipes run concurrently. The VP-200 has
advanced optimization facilities to generate efficient object code through vector-
ization of sequential constructs, pipeline parallelization, vector-register allocation,
and generation optimization.
The strength of the VP-200 lies in its impressive throughput and easy pro-
gramming environment. The Fortran 77 compiler is developed with advanced
automatic optimization feature and convenient tuning tools and application
library. The system can be a simple add-on to a front-end processor with MVS/OS
as a loosely coupled multiprocessor. It can utilize many of the existing software
WWW.Gitmgurgaon.blogspot.com
296 COMPLITER ARCHITECTURE AND PARALLEL PROCESSING
assets. With more reliable circuit and packaging technology, the system can self-
recover from hardware errors. The Fortran 77/VP components in the VP-200
include language processors for both scalar and vector objects, tuning tools,
debugging tools, and subroutine packages for special scientific computations, such
as for solving linear systems, cigenvalues, differential equations, fast Fourier
transform, ete.
Although high-speed components, high degree of parallelism and/or pipelining,
and large-sized main memory are the basic requirements to stretch the computa-
tional capabilities ofa supercomputer, the VP-200 designers consider the following
features as also important for such a machine to be versatile enough for a wide
range of applications:
interactive feature, for example, a programmer can provide the compiler with
useful information for higher vectorization.
Vector editing functions The FACOM vector processor provides two types of
editing functions: compress-expand operations and vector indirect addressing.
WWW.Gitmgurgaon.blogspot.com
298 COMPUTER ARCHITECTURE ANID PARALLEL PROCESSING
These functions can be used not only for the conditional vector operations but for
sparse matrix computations and other data editing applications. Vectors on vector
registers can be edited by compress and expand functions by using load-store pipes
as data alignment circuits; no access to the main memory is involved in these cases.
Compressing a vector A means that the elements of A marked with Is in the
corresponding locations of the mask vector are copied into another vector B,
where these elements are stored in contiguous locations with their order preserved.
Expanding a vector means the opposite operation, as the example below shows:
In the vector indirect addressing, a vectorJ on vector registers holds the indices
for the elements of another vector A stored in the main memory, which is to be
loaded into vector C defined on vector registers. Namely, C(D) = ACU/(/)). This is a
very versatile and powerful operation. since the order of the elements can be
scrambled in any manner. The data transfer rate for vector indirect addressing,
however, is lower than that for the contiguous vectors, due to possible bank and/or
bus conflicts.
Example 4.10 This example describes the vector indirect addressing in VP-200.
I | 0 1 0 01 1 Mask vector
1 2 4 7 8 List vector generation
A, Az Ag Ay Ag Indirect vector load A
Vector registers optimization One of the most unusual features of the FACOM
vector processor is the dynamically configurable vector registers, The concept of
vector register is very important for efficient vector processing, since it drastically
reduces the frequency of accesses to the main memory. The results of our study
indicate that the requirements for the length and the number of vectors vary from
PIPELINE COMPUTERS AND VECTORIZATION METHODS 299
one program to another. To make the best utilization of the total size of 64K_ bytes,
the vector registers can be concatenated to take the following configurations:
32 (length) x 256 (vector counts), 64 x 128, 128 x 64,..., 1024 x 8, The length
of vector registers is specified by a special hardware register, and it can be altered
by an instruction in the program,
The compiler must know the frequently used hardware vector length for each
program, or even within one program the vector length may have to be adjusted.
When the vector length is too short, load-store instructions will be issued more
frequently, whereas if it is unnecessarily long, the number of available vectors will
be small and vector registers will be wasted. As a general strategy, the compiler puts
a higher priority on the number of vectors in determining the register configuration.
A programmer can also interactively provide the compiler with the information on
the vector length.
Pipeline
resources
Vi
Vector seven i
V3
SL
Scalar
2 $3
Sa
Vector v4
SS.
Scalar 3e
S17
et
Time
Figure 4.34 Concurrent processing of vector and scalar instructions in VP-200. (Courtesy of Fujitsu
Limited, Japan.)
WWW.Gitmgurgaon.blogspot.com
300 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
The compiler performs the extensive data-flow analysis of the Fortran source
programs and schedules the instruction stream, so that the vector arithmetic
pipeline units are kept as busy as possible. This process includes the reordering of
instruction sequence, balanced assignments of two load-store pipes, and insertion
of serialization instructions, wherever necessary.
A comparison of the modern pipeline vector supercomputers that we have
studied is given in Table 4.14. This table summarizes the instruction repertoire,
basic system specifications, functional pipes, vector registers, main memory, peak
CPU speed, vectorizing facilities, front-end host computers, and possible future
extensions for the Cray-1, the Cyber-205, and the VP-200. The option of one vector
processor in the Cyber-205 is assumed in the comparison. With the introduction
of the Cray X-MP and the options of having two or four vector processors in a
Cyber-205, one can conclude that the three vector supercomputers have essentially
the same computing power. It is interesting to watch for their future upgraded
models.
WWW.Gitmgurgaon.blogspot.com
302 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
@,285583
@,:e
1 2
(4.3)
e,: #8,
where e,, ¢;, and ¢, afe expressions of indexing parameters as they appear in a
DO statement: e, indicates the first element or the initial index value, ¢, indicates
the terminal index value, and e, is the index increment or skip distance. Jf e, is
omitted, the increment is one; this includes all of the elements from e, to e,. The
single symbol “*” indicates that all of the clements are in a particular dimension.
If the elements are to be used in reverse order, the notation “-*” may be used.
In a binary vector operation, the two operands must have equal length with
only a few exceptions. Each vector operation may be associated with a logical
array that serves as a control vector. A WHERE statement may allow the pro-
grammer to indicate the assignment statements to be executed under the control
of a logical array. The following PACK and UNPACK operations demonstrate
the use of control vectors.
Example 4.13 Given: DIMENSION A(6), B(6), C(6); DATA A/—3, —2,1,3,
—2,5/
Then: PACK WHERE (A .GT. 0) B=C causes elements of C in positions
corresponding to “trues” in A,GT.0 to be assigned to B elements such
that BC) = C(3), B(2) = C(4), BG) = C(6);
PIPELINE COMPUTERS AND VECTORIZATION METHODS 303
An intrinsic function needs to compute with each element ofa vector operand.
For example A(i:10) = SEN(B(1:10)) is a vector intrinsic function. Several special
vector instructions are shown in the following example:
DO 20 1=8,120,2
20 A(I)=B(I+3)+C(1+1)
are being converted into a single vector statement:
A(8:120:2)=B(11:123:2)+€(9:121:2) (4.4)
WWW.Gitmgurgaon.blogspot.com
364 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
A(O)=X
DO 20 1=1,N
20 A(I)=A(I-1) BI) +C(141)
is being converted to be:
A(OJ=X (4.5)
A(1:N)=A(O: N-1) *B(1:N)+C(2:N+1) ™
DO 20 1=1,N
20 IF(L().NE.O) A()=A0)-1
to
DO 20 1=1,N
AQ) =B(-1)
20 B(I)=2*B(I)
to the following code:
B(1:N})=2#B(1:N)
A(1:N)=B(0:N-1) a7)
Example 4.19 Temporary storage can be used to enable parallel computa-
tions, such as converting the statements
DO 201=1,N
AW)=BUI)+C()
20 BU) =2*A(I+1)
to vector code
TEMP (1:N)=A(2:N+1)
A(1:N)=B(1:N)+C(1:N) (4.8)
B(1:N)=2*TEMP(1:N)
PIPELINE COMPUTERS AND VECTORIZATION METHODS 305
Example 4.20 Prolonging the vector length is always desirable for pipeline
processing. The two levels of array computations
DO 20 1=1,80
DO 20 J=1,10
20 A(J)=B(IJ)+C(1J}
can be rearranged to promote better pipelining:
DO 20 J=1,10
DO 20 |=1,80
20 A(IJ)=BUJ)+C UL)
Other techniques such as register allocation, vector hazard, and instructions
rearrangement are also machine dependent. For example, we want to allocate the
vector registers in the Cray-1 to result in minimal execution time. Rearranging the
execution sequence to execute the same vector operations repeatedly can reduce
the pipeline reconfiguration overhead in a multifunctional pipe. A vectorizer
informs the programmer of the possibility of paraliel operations. It provides also
a learning tool, in that the programmer can examine the output of the vectorizer
and tune the computations for better pipelining. Automatic vectorization and code
optimization will increase the programming productivity on vector processors.
Tie BeC
At-A+T1
where T1 is a compiler generated temporary identifier.
WWW.Gitmgurgaon.blogspot.com
30 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Veetorizing Ratio
50% 100%
; ae |i Complicated Simple
Scalar code operation operation
Simple
Yectorization | , 50 15t
Complicated }
vectorization |! i: Intrinsic scalar operation
i> Execution in {not vectorizable)
Vector code |j = vector form
Vector Speed _ 0
123 Sealar Speed
Figure 4.35 Vectorization ratio and saving in execution time. (Courtesy of Fujitsu Limited, Japan.)
compiler which can efficiently access complicated data structures and tune
branch-disturbed program structures. The compiler requires sophisticated opti-
mization techniques and improved hardware architecture. Major optimizing
functions are listed below according to the levels of sophistication in generating
efficient scalar and vector code modules.
1. General optimization
{a} Common expression elimination
(6) Invariant expression movement
(c) Strength reduction
{d) Register optimization
{e) Constant folding
2. Extended optimization
(a) Intrinsic function integration such as SQRT, SINE, etc.
(b) User Fortran subprogram integration
(c) Reductions of iteration numbers in nested DO loops
(d) Reorder of execution sequence to reduce pipeline overhead
(e) Temporal storage management
(f) Code avoidance
3. Vector extended optimization
(a) Full vectorization
(6) Pipeline chaining
(c) Pipeline antichaining
(d) Vector register optimization
{e) Parallelization
WWW.Gitmgurgaon.blogspot.com
308 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Vectorization / oo MD
From scalar to vector
_— Genome
Vector register
optimization Large capacity Co
vector register
Optimized assignment
Advanced
General Optimization optimization
technique
Figure 4.36 Vectorizing compiler design techniques in the Fujitsa FACOM VP-200, (Courtesy of
Fujitsu Limited, Japar.}
{A} Redundant expression elimimation After the scan and parse phase, some
redundant operations in the intermediate code could be eliminated. The number
of memory accesses and the execution time can be reduced by often eliminating
redundant expressions.
PIPELINE COMPUTERS AND VECTORIZATION METHODS 309
VIHA(1:99: 2)
V2+-B(1:50)
V3=—V1*V2 V1~A(1:99:
2)
V4—A(1:99:2) V2<—B(1:50) (49)
V5<-C(1:50) ao V1IeViFV2 ‘
V6—VdeVG Vie 2*V1
V7<-V3+V6 C(1:50)+-V1
C(1:50)—V7
{B) Constant folding at compile time In its full generality, constant folding means
shifting computations from run time to compile time. Although the opportunities
to perform operations on constant arrays is not often, such opportunities will
crop up occasionally, particularly as initialized tables. For example, a DO loop
for generating the array A(1) = LforI = 1,2,..., 100 can be avoided in the execu-
tion phase because the array A(I) can be generated by a constant vector
(1, 2, 3, ..., 100) initialized in the compile time.
(C) Invariant expression movement The innermost loops are often just scalar
encodings of vector operations. In such a case, the innermost loop can be replaced
by some vector operations if there is no data dependence relation or no loop
variance. If some of the statements are loop-variant, the loop-invariant expressions
may be moved out of the loop. This is called code motion. Consider the following
two-level DO loop:
DO 20 1=1,N
DO 20 J=1,M
B(I.J)=B(I,J)+A(J)
«C(J)
20 CONTINUE
WWW.Gitmgurgaon.blogspot.com
316 COMPUTER ARCHITECTURE ANID PARALLEL PROCESSING
20 CONTINUE
The vector multiplication can be moved out of the outer loop using a temporary
vector T(*}:
T(#) SAC
+} 4C(*)
DO 201=1,N
B(1,*)=B(I.*)+T(«)
DO 20 J= 1.M
20 CONTINUE
Lead/ : ’
Store I B load XY D toad \ F load I A store \
piper | | Ht
Load/ n! I
Store | C load E load G load | : |
Pine | || . i
Multiply Multiply
. \Y
.
MultiplySe Multiply \
HS \
pipe ! : te '
Ade] | | i
rie} i
|
i
|
i
f
11 Add
1k
!
i
{ L 1 it Le t
tooo ty ty ts by
Figure 4.37 Pipeline chaining and parallelization for linked vector operations. (Courtesy of Fujitsu
Limited, Japan.)
PIPELINE COMPUTERS AND VECTORIZATION METHODS 31]
Example 4.22 Based on the hardware architecture of the VP-200 in Figure 4.33,
we can implement the following vector operations, as demonstrated in
Figure 4.37,
A(I)=B()
«C(I +D(DeE(D+FU)+G(l) for 1=1,2,.....N (4.10)
Three pairs of vector load operations, (B,C), (D, E), and (F, G), are done in
two load-store pipelines. Three vector multiply operations, B * C, D * E, and
F * G, are carried out in the multiply pipeline from time ¢,. The twe vector
add operations are executed in the add pipeline from time t,. The first result
becomes available at 1,. With a minor pipeline reconfiguration delay, the
store operation begins at f,. The entire operation requires tg — it) =
4N+ A, where A, =i, — ¢, is the delay of one pipeline. It was assumed
that all pipes have equal delays.
(E} Vector register allocation On a machine like the Cray-1 allowing the chaining
of vector operations, the importance of register allocation cannot be overstated.
To achieve the maximum computing power, the execution units must be fed a
continuous stream of operands. Retaining vectors in registers between operations
is one way to achieve this. However, the strategy of vector register allocation
emphasizes local allocation rather than global allocation, because of the limited
number of available vector registers in a system.
VIA
V2<--B
V3<C
V2—-V2«V3
V1IeV1-V2
If we do the multiply before loading the vector A into a register, only two
vector registers are required:
V1ie-B
V2-C
V1—-V1 «V2
V2-A
V2<-V2-V1
WWW.Gitmgurgaon.blogspot.com
312 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
AleA2+A3 A1—A2+A3
A4~A5+A6 Bi«-A7+A8
B1i<A7+A8 > B5<-C1+C2 (4.E1)
B2~-B3«B4 A4-A5#AG
B5<-C1+C2 B2+-B3+B4
The sequence on the left requires three pipeline reconfigurations, while the right
one requires only one. An intelligent compiler should be able to reorder the
execution sequence to minimize the number of required pipeline reconfigurations.
A(1:50)=B(1:99:2)
C(1:50) =2.0*A(1:50)
A copy is avoided (or at least delayed), if the compiler adjusts the storage-mapping
function for array A to reference the storage for array B. Thus, instead of producing
the following code:
V1—B(1:99:2)
A(1:50)+V1
V1eA(1:50)
VIRKZO«VI
C(1:50)—V1;
V1e-B(1:99: 2)
V1—2.0«V1
C(1:50)+-V1
{D Tuning for interactive vectorization Tuning tools are necessary to provide some
user interactions in advanced vectorization. Both the Cray-1 and the VP-200 have
some tuning facilities. The tuning facility in the VP-200 is illustrated in Figure
4,38. From the displayed tuning information and vectorizing effects, the user can
modify the source program with the help of an interactive vectorizer package. The
modified source program will be optimized towards full vectorization by the
vectorizing compiler. A number of compiler directive lines are useful to check
whether recurrence appeared in the source code, the true ratio in IF statements,
the vector length distribution, and others. The vectorizing compiler generates the
WWW.Gitmgurgaon.blogspot.com
314 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Interactive FORTRANT7/VP
< —_>
— => compiler .
vectorizer
(vectorization)
tuning statements :
Neng!
User's source
Figure 4.38 The tuning facilities in FACOM VP-200. (Courtesy of Fujitsu Limited, Japan.)
optimal object code with vector instructions after the tuning process is completed.
An example is given below to illustrate the tuning concept realized in the VP-200.
DO 20 1=1,N
IF(A(l).GE.10) GO TO 20
T=C1 #A(l) +C2*B(1)
A(l) =SIN(T)
20 CONTINUE
If one realizes the true ratio is 90 percent, the compiler chooses vector indirect
addressing which results in several times better performance than the case
without information for tuning. The compiler chooses masked arithmetic,
if the optimization information is not available.
of a pipeline computer depends on the pipeline rate, the work load distributions,
the vectorization ratio, and the utilization rate of system resources.
The system performance of a vector processor is measured by the maximum
throughput (W’), that is, the maximum number of results that can be generated per
unit time, such as the megaflops we have used. The processing of a vector job in a
pipeline occupies the equipment (segments) over a certain length of time. The
enclosed area in a space-time diagram depicts the pipeline hardware utilization as a
function of time. The segments of a pipeline may operate on distinct data operands
simultaneously. Pipelining increases the bandwidth by a factor of k, since it may
carry k independent operand sets in the k segments concurrently. A pipeline
computer requires more hardware and complex control circuitry than a corre-
sponding serial computer. The following notations are used in the performance
analysis:
where k . tis the time required to fill up the pipe. As N; becomes Jong, 7; approaches
N,-t because k is usually a small integer. The system throughput W then
approaches k/T, accordingly. The above derivation is for a single vector instruction
with vector length N;.
Consider a sequence of # vector instructions. The degree of parallelism in each
vector instruction is represented by its vector length N,, fori = 1,2,...,n. Suppose
that the execution of different types of vector instructions takes the same amount
of tire if they have the same vector length. The total execution times required in
a pipeline processor is equal to
WWW.Gitmgurgaon.blogspot.com
316 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
T. SN, k YN
Vie
S,2-52—
i
<1 os ate (4.14)
a
? r-|@— Dn EN,]/h (kK - Dat ¥N,
i=1 i; ii=1
Te Ssz
naga (4.15)
The pipeline efficient can be interpreted as the ratio of the actual speedup to
the maximum possible speedup &. A numerical example is used to demonstrate
the analytical measures. Consider a vector job with a vector length distribution
N, = 7,3, 10. 1, 4, 6, 2, 5, 2, 4 for n = 10 vector instructions. Figure 4.39 plots S,
and # against different values of k with respect to the above distribution. When k
increases beyond the average vector length of N, (Le, 44 in this example), the
increase in speedup becomes rather flat while the processor utilization continues
to decline.
In general, pipeline processors are in favor of long vectors. The longer the
vector fed into a pipeline, the less the effect of the overhead will be exhibited.
Figure 4.40 shows the relation between the vector length and the system perfor-
mance on a pipeline computer with k = 8 stages. The speedup increases mono-
tonically until reaching the maximum value of k = 8, where the length approaches
infinity. The dashed line in Figure 4.41 displays the effect due to partitioning the
vector operand into 16-element segments. The maximum speedup drops to
8 « 16/16 + 7) = 5.5 with vector looping. This occurs when the vector length is a
multiple of 16, the number of component registers in the vector register.
The pipeline efficiency depends also on the vector length distribution. Too
many scalar operations of different types will definitely downgrade the system per-
formance. To overcome this drawback, an intelligent vectorizer can help improve
the situation. The Cray-1 has a scalar processer which is more than two times
faster than the CDC 7600. When the vecter length is short, execution by a scalar
processor should be faster than execution in a vector pipeline.
The throughput rate reflects the processing capability of a pipeline processor.
A higher throughput rate may be obtained at the expense of higher hardware cost.
Therefore, the cost effectiveness of a pipeline design should not be ignored.
Cost effectiveness can be measured by megaflaps per million dollars. Table 4.15
presents the performance and cost ratio of several pipeline computers. The
efficiency of a pipeline computer may depend on both hardware cost and the
PIPELINE COMPUTERS AND) VECTORIZATION METHODS 317
Speedup (S,)
Efficiency 7}
02 “41.0
k
4 J { L | L _L
l 3 4 § 6 7 8 9 10
Number of pipeline stages
Figure 4.39 The speedup (.5,) and efficiency (#) of a pipelined processor with k stages.
4
8.000 -
4.800 ye ve(With —
Pipeline speedup
. ye vector segmentation
into 1§-element groups)
3,200 }~
1.600 r
0.000: L 1 4 E 4 L ! |
9 8 16 24 32 40 48 56 64
Vector length
Figure 4.40 Pipeline speedup with and without vector looping. (Courtesy of Advances in Computers,
Vol. 26, Hwang et a], 1981.)
WWW.Gitmgurgaon.blogspot.com
318 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
//
/ é
fot ‘
if
! foo}
1
, ? 4
7 é :
r, ;
s's +] Ey [i Jw ene eee ‘
5, {Botleneck) Ae iw
” +
f ! ’ ' i
f ; a : !
oa
/ '
‘ F
# : $
‘
// : ff I‘
i' t:
5, Ge oonnnvn mene LG
j ‘ ai
delay of each pipeline stage. Let C; be the cost and +; be the delay of the ith stage
5; In a pipeline having k stages. Let n be the total number of jobs streamed into
the pipeline during the period of measurement (assuming continuous input of
jobs). Let t be the pipeline clock period equal to the delay of the bottleneck stage
(c = 1, in Figure 4.41). The efficiency in Eq. 4.15 can be refined to be:
k
n- Set
i=1
q= eo (4.16)
Eo] eat D+]
i= I i=1
As illustrated in Figure 4.41, the numerator in Eq. 4.16 corresponds to the total
weighted space-time span of n productive jobs. The denominator designates the
total weighted space-time span of all k stages, including both productive and idle
periods of all hardware facilities. In the ideal case with uniform delay (zc, = t for
all 7), this efficiency can be reduced to the form in Eq. 3.5:
"ee arD (4.17)
ft
PIPELINE COMPUTERS AND VECTORIZATION METHODS 319
Note that 7 = S,/k, where S, is the speedup defined in Eq. 4.14. When the pipeline
approaches the steady state with sufficiently long vector input, the limiting effi-
ciency becomes
lim y = EL (4.18)
hoo t: VC,
In the ideal case of uniform delay, (; = t for all 7), the above limit will be | and the
maximal speedup will be achieved (S, > &).
The cost effectiveness is indicated by the potential throughput performance
relative to the total processor cost. The optimal pipeline design will maximize such
a performance-cost ratio. Let T, be the total time required to process a job in a
nonpipeline serial processor. Consider the execution of the same job in an equiv-
alent pipeline processor with k stages. The pipeline clock period is set to be t =
T,/k + 0, where 6 is the latch delay. Thus, in n-t = T, + n-@ time units, n results
can be produced. This implies the following system throughput:
- a a (4.19)
a(Tjk +0) Tyk +0
k
Let C = » C; be the total cost of all pipeline stages and d be the average
latch cost. The cost of the entire pipeline is equal to C + (kd). A performance-cost
ratio (PCR) for the pipeline processor is defined as
Ww 1
ee Sc 4.20
POR = eg TREO (Ct bed) (4.20)
The optimal design of a static linear pipeline processor requires ky stages sach
that the PCR is maximized. Differentiating the PCR with respect to k, we obtain
the first-order derivative
aPCR) _
(4.21)
WWW.Gitmgurgaon.blogspot.com
320 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Maximum
Performance cost ratio (PCR)
ky = Tc
aod (4.22)
This is indeed the maximum, because one can prove that 67{PCR)/6k? < 0 when
evaluated atk = ky.
This optimal stage number is always greater than one for a reasonably com-
plex pipeline. In the above discussion, we have emphasized local optimization of
the pipeline processor. In practical design, one has to consider the global perfor-
mance of the entire computer system, which may include parameters on memory
access and program behavior. The PCR given in Eq. 4.20 is plotted in Figure 4,42
as a function of the number & of pipeline stages. The peak of the curve corresponds
to the optimal number, ky, of pipeline stages.
Comparative studies of the early vector processors Star-100 and TI-ASC were
given by Higbie (1973) and Theis (1974). The Cray-1 and Cyber-205 have been
characterized in Kozdrawicki and Theis (1980), Hockney and Jesshope (1981)
PIPELINE COMPUTERS ANI) VECTORIZATION METHODS 32]
have treated pipeline vector processors and array processors. Others can be found
in Thurber (1976, 1979a, b), Kuck (1977), Chen (1980), and Kogge (1981), Litera-
ture devoted to the CDC Star-100 and TI-ASC can be found in CDC manuals,
Hintz and Tate {1972}, Purcell (1974), Stone (1978), Ginsberg (1977), Watson
(1972a, b), Watson and Carr (1974), and Texas Instruments manuais.
The Cray-1 computers were studied in Johnson (1978), Peterson (1979),
Russell (1978). Cray Research manuals, Dorr (1978), and Baskett and Keller
(1976). The Cyber-205 is described in Kascic (1979). The materiai presented
in Section 4.4.4 is based on the work by Lincoln (1982). Manual information on
the Cyber-205 can be found in CDC manuals. Material on the VP-200 is based
on the report by Miura and Uchida (1983). The AP-120B has been described and
assessed in Wittmayer (1978), and Floating Point Systems manuals. Other attached
processors were reported in Datawest (1979), IBM (1977), and Thurber (1979b),
Vectorization methods for pipeline computers can be found in CDC and Cray
Research manuals, Paul and Wilson (1978), Kennedy (1979), Loveman (1977),
and Hwang et al. (1981). Vectorizing compilers are also studied in Arnold (1982),
Brode (1981), and Kuck et al. (1983). Parallel programming languages and the
optimization of vector operations are still wide open areas for further research
and development. Performance of pipeline processors has been evaluated in Chen
(1975), Bovet and Varneschi (1976), Baskett and Keller (1977), Larson and
Davidson (1973), Ramamoorthy and Li (1975), Hwang and Su (1983), and Stokes
(1977).
Problems
4.1 Design a pipeline multiplier using carry-save adders aad a carry-lookahead adder to multiply a
stream of input numbers ¥,, 42. .¥3,.... by a fixed number Y, You may assume that ail , and Y are
n-bit positive Integers. The output should be a stream of »-bit products Y,-¥. ¥,-¥. Ay ho...
Determine the pipeline clock rate in terms of the delays a, f, and y, where
4.2 Consider a static multifunctional pipeline processor with A stages. each stage having a delay
of E/k time units. The pipeline must be drained between different functions, such as addition and multi-
plication. Memory-access time, contrel-unit time, etc., can be ignored. There are sufficient numbers of
temporary registers to use.
(a) Determine the number of unit-time steps 7, required to compute the product of won x n
matrices on a nonpipeline scalar processor. Assume one unit time per each addition or each multi-
plication operation in the scalar processor,
(by Determine the number of time steps 7), required to compute the matrix product, using the
multifunction pipeline processor with a total pipeline delay of one time unit,
(c}) Determine the speedup ratios 7/T,, when = |, = k,andn = m-k for some large m.
4.3 Compare the second-generation vector processors Cray-1, Cyber-205, and VP-200 ia the following
aspects:
{a) Instruction sets: vector versus scalar instructions
(6) Functional pipeline structures and usage
WWW.Gitmgurgaon.blogspot.com
322 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
X[f]
BU) f
c em LATCH]
} {
ci MUX MUX pec
i I
1
2
q
Y= ZN}
0 0 Cc, Xfi]
o | C, Bes)
PE o0 BY) xT
Pol Bi), Ba-y)
4.6 Conduct a thorough comparison of back-end computer systems, including both vector super-
computers and the attached scientific processors, in the feHowing aspects:
{a) Peak scalar performance
{b) Peak vector performance
{ce} Pipeline clock rate
{d) Memory bandwidth
(e} Cost of central processing unit
U) Performance/cast ratio
(g) Main memory capacity
(A) Register files/buffers
() Functional pipelines
4.7 Let ACL: 2N), BOL:2N}, CU:2N) and D(1:2N) be each a 2N-element vector stored in the main
memory of a vector processor. Each vector register in the processor has V components. There are two
load-store pipes, one multiply pipe, and one add pipe that are available to be used. Draw a space-time
diagram (similar to that in Figure 4.37 for Example 4,22) to show the pipetine chaming and paraileliza-
tion operations to be performed in the execution of the following linked vector instructions with
minimum time delay:
Assume that all pipeline units, regardless of their functions, have equal delays as assumed in Figure 4,37,
Sufficient numbers of vector registers are assumed available, and they can be cascaded together to hold
longer vectors, such as 2N, 3N,..., ete.
4.8 A vector arithmetic reduction unit is shown im Figure 4.43, This multifunction pipeliae can accept
vector inputs and produce a single scalar output at the end ef computation, The feedback connecticn fs
needed for accumulated arithmetic operations. Develop four fast algorithms for scheduling the succes-
sive computations (muftip’s and add-subtract’s), needed in the following vector reduction arith-
metic operations:
(a) The summation of the 7 components in a vector
(5) The dot product of two #-element vectors
(c) The trultiplication of two x #2 matrices
(@) The searching of the maximum among » components of a vector
Aint: Algorithm (a) may be used in algorithm (5). Both (a) and (6) may be used in algorithm (c), In
part (d), comparison is done by subtraction through the pipeline unit,
4.9 Let D be a stream of data (tasks or operand sets). Suppose that we wish to perform two functions
f, andf, on every task in D, That is, for each operand set x in D, we want to compute /,{f\(x)). The
computations are to be performed on a machine with multiple pipeline functional units, one of which
computes the function/, and another which computes f;. The reservation tables for/, and_/f, are given
below,
WWW.Gitmgurgaon.blogspot.com
324 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
fr 2 3 4 5 6
Six Xx i
Sy x x
Sy x x
fy 1 2 3 4 fy 1 2 3 4 5
Sfx x Sy] x x
49 x x 5; x x
ay x % x
_——P fi > A _ fy en
(a) What are the maximum throughputs for thef, and #, pipelines, assuming that they work
completely independently on one another? That is, assume that the two pipelines work on completely
independent data streams D, and D2.
(6) In chaining, the output of one pipeline is applied directly to the input of another pipeline,
One can think of this as configuring the pipelines such that the output latch or buffer of the first pipeline
becomes the input latch or buffer of the second. What is the maximum throughput for tasks in D if the
Jf and f, pipeline functional units are chained together?
{c) What can you conclude about the general effectiveness of chaining pipelines that have feed-
back? Consider the effect on memory contention and the demand on memory bandwidth as part of
your answer.
4.10 Consider three functional pipelines f,,/,. and/, characterized by the reservation tables in Figure
4.44a.
(a) What are the minimal average latencies in using thef,, f2, and /, pipelines independently?
(6) What is the maxinuan throughput if three pipelines are chained into a linear cascade as showa
in Figure 4.44b?
4.11 Show the timing diagrams for implementing the two sequences of vector instructions (described in
Example 4,23) on the Cray-1 machine. Verify the total clock periods required in each of the two com-
puting sequences.
CHAPTER
CO- 3
FIVE
STRUCTURES AND ALGORITHMS
FOR ARRAY PROCESSORS
This chapter deals with the interconnection structures and parallel algorithms for
SIMD array processors and associative processors. The various organizations and
control mechanisms of array processors are presented first. Interconnection
networks used in array processors will be characterized by their routing functions
and implementation methods. We then study the structure of associative memory
and parallel search in associative array processors. SIMD algorithms are presented
for matrix manipulation, parallel sorting, fast Fourier transform, and associative
search and retrieval operations.
/O
CU memory
cu sotcrceetteencnntttes ctttttnettnnertentney
Control bus
i
|i
i
:
PE, bq PED be i
— a) ‘
PEM, PEM, 1
i t
Interconnection network [oe nore ‘Control
vo
Data bus
CU memory
cu
Ree eens co eee ere ewe etme
Ee
i
E
5
tt
working registers and local memory PEM; for the storage of distributed data.
The CU also has its own main memory for the storage of programs. The system
and user programs are executed under the control of the CU. The user programs
are loaded into the CU memory from an external source. The function of the CU
is to decode all the instructions and determine where the decoded instructions
should be executed. Scalar or control-type instructions are directly executed inside
the CU. Vector instructions are broadcast to the PEs for distributed execution to
achieve spatial parallelism through duplicate arithmetic units (PEs).
All the PEs perform the same function synchronously in a lock-step fashion
under the command of the CU. Vector operands are distributed to the PEMs
before parallel execution in the array of PEs. The distributed data can be loaded
into the PEMs from an external source via the system data bus, or via the CU ina
broadcast mode using the control bus. Masking schemes are used to control the
status of cach PE during the execution of a vector instruction. Each PE may be
either active or disabled during an instruction cycle. A masking vector is used to
control the status of all PEs. In other words, not all the PEs need to participate in
the execution of a vector mstruction. Only enabled PEs perform computation.
Data exchanges among the PEs are done via an inter-PE communication network,
which performs all necessary data-routing and manipulation functions. This
interconnection network is under the control of the control unit.
An array processor is normally interfaced to a host computer through the
control unit. The host computer is a general-purpose machine which serves as
the “operating manager” of the entire system, consisting of the host and the pro-
cessor array. The functions of the host computer include resource management
and peripheral and 1/O supervisions. The control unit of the processor array
directly supervises the execution of programs, whereas the host machine performs
the executive and 1/O functions with the outside world. In this sense, an array
processor can also be considered a back-end, attached computer, similar in func-
tion to those pipeline attached processors studied in Chapter 4.
Another possible way of constructing an array processor is illustrated in
Figure 5.16. This configuration II differs from the configuration I in two aspects.
First, the local memories attached to the PEs are now replaced by parallel memory
modules shared by all the PEs through an alignment network. Second, the inter-
PE permutation network is replaced by the inter-PE memory-alignment network,
which is again controlled by the CU. A good example of a configuration HW SIMD
machine is the Burroughs Scientific Processor (BSP). There are N PEs and P
memory modules in configuration Il. The two numbers are not necessarily equal.
In fact, they have been chosen to be relatively prime. The alignment network is a
path-switching network between the PEs and the parallel memories. Such an
alignment network is desired to allow conflict-free accesses of the shared memories
by as many PEs as possible.
Array processors became well publicized with the hardware-software develop-
ment of the Illiac-IV system. Since then, many SIMD machines have been con-
structed to satisfy various parallel-processing applications. The Burroughs
Parallel Element Processing Ensemble (PEPE) and the Goodyear Aerospace
WWW.Gitmgurgaon.blogspot.com
328 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Staran are two associative array processors. Extended from the Hliac-IV design
are the Burroughs Scientific Processor (BSP) and the Goodycar Aerospace Massive-
ly Parallel Processor (MPP).
Formally, an SIMD computer C is characterized by the following set of
parameters:
C = (N,F,1,M) (5.1)
where N = the number of PEs in the system. For example, the Iliac-I[V has N =
64, the BSP has N = 16, and the MPP has N = 16,384.
F =a set of data-routing functions provided by the interconnection
network (in Figure 5.1a) or by the alignment network (in Figure 5.14).
J = the set of machine instructions for scalar-vector, data-routing, and
network-manipulation operations.
M = the set of masking schemes, where cach mask partitions the set of
PEs into the two disjoint subsets of enabled PEs and disabled PEs.
This model provides a common basis for evaluating different SIMD machines.
We will characterize various data-routing functions in the next section when we
study interconnection networks for SIMD machines. The instruction sets of
important array processors will be discussed with those example SIMD computers
in Chapter 6.
In addition to regular SIMD machines, several algorithmic array processors
have been developed as back-end attachments to host machines. Among them
are the IBM 3838 and the Datawest MATP. These attached array processors
are highly pipelined for array processing. They are not SIMD machines as dis-
cussed above. The reason that these pipeline attached processors are commercially
known as “array” processors lies in the fact that they are used for processing arrays
of data, Details of the Illiac-IV, BSP, MPP and multiple-SIMD computers using
a shared pool will be treated in Chapter 6.
To CU
4, 8, G
D, A, R, Lae other
. PEs via the
: interconnection
| network
PE, q ; Pe
For/=06,1,2,...,N—1
PEM
inputs and outputs of R; are totally isolated by using master-slave flip-flops. Each
PE, is either in the active or in the inactive mode during each instruction cycle.
If a PE, is active, it executes the instruction broadcast to it by the CU. If a PE; is
inactive, it will not execute the instructions broadcast to it. The masking schemes
are used to specify the status flag 8, of PE,. The convention S; = 1 is chosen for
an active PE, and S, = 0 for an inactive PE,. In the CU, there is a global index
register J and a masking register M. The M register has N bits. The ith bit of M
will be denoted as M,. The collection of S, flags for i = 0, 1, 2,..., N — 1 forms a
status register § for all the PEs. Note that the bit patterns in registers M and §
are exchangeable upon the control of the CU when masking is to be set.
WWW.Gitmgurgaon.blogspot.com
330 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
NEEDED Example 5.2 To illustrate the necessity of data routing in an array processor,
we show the execution details of the following vector instruction in an array
of N PEs. The sum S(k) of the first k components in a vector A is desired for
eachAfromOtoa ~ 1. Let A = (Ay, Ay,....A,—1). We need to compute the
following summations:
ie
Skj= ¥ A, fork #0,1,...,2—-1 (5.3)
f=0
STRUCTURES AND ALGORITHMS FOR ARRAY PROCESSORS 331
PE, Ay 0 9 0 50)
Y
PE A, 0,1 6,1 0,1 | Sd)
1
PE, A, 1,2 0-2 0-2 5(2}
Y
PE, Ay 2,3 0-3 0-3 5G)
Y
PE Ay 3,4 14 0-4 | St4)
y
PE, A, 4.5 2-5 0-5 5(5}
f
PE Ag 5,6 3-6 Oe | 6}
WWW.Gitmgurgaon.blogspot.com
332 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
R;,. for i = Oto 5. In the final step, the intermediate sums in R; are routed
to R;44 for = 0 to 3. Consequently, PE, has the final value of S(k) for k =
0,1, 2,..., 7, as shown by the last column in Figure 5.3.
As far as the data-routing operations are concerned, PE, is not involved
(receiving but not transmitting) in step 1. PE; and PE, are not involved in
step 2. Also PE;, PE,, PE;, and PE, are not involved in step 3. These un-
wanted PEs are masked off during the corresponding steps. During the
addition operations, PE, is disabled in step 1; PE, and PE, are made inactive
in step 2; and PE,, PE,, PE,. and PE, are masked offin step 3. Fhe PEs that
are masked off in cach step depend on the operation (data-routing or arith-
metic-addition) to be performed. Therefore, the masking patterns keep
changing in the different operation cycles, as demonstrated by the example.
Note that the masking and routing operations will be much more complicated
when the vector length # > N,
WWW.Gitmgurgaon.blogspot.com
334 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
WWW.Gitmgurgaon.blogspot.com
336 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Lr
9 ,| IS ™*) os £
o | 6 : 0
. o
a a
1 J} is fe ™ os |
1 : ; 1
td o
Ne ZT
td
a a e .
o
direct connection b/w one PE and
ae ay another PE-> single- stage
wot} as tO ™) os N=I interconnection n/w ex: fig (g)
S| Nm dle 2. | N-1
. un Figure 5.5 Conceptual view of a single-
> s stage interconnection network.
4-———— 4
Switch
box
a rr panennnennreanenesersssmmmnigie by
2) ond
pee Exchange
CE,
1 —aneeenennmeins
Le
0 crclinrd
. ———ere 8,
0
ee Upper
eee broadcast
cane Ee amen ne Pee fee £5
ay nee Ke me eteletaied —— IR by
ae Lower
~ broadcast
peeve £14
Figure 5.6 ‘A two-by-two switching box and its four interconnection states,
WWW.Gitmgurgaon.blogspot.com
338 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Lt LL
Ld
axe rxr mx
| > |
e 1 fs - ] D . 1 .
1 em . panna
Te a
. 2 2h | 2 | 8
| Lm
. .
s s °
° .
~~ armed
: r m r |e
—| poe
Figure 5.7 Several multistage
(c) Clos network interconnection networks.
pass any of multiple mappings of inputs onto outputs. The crossbar switch network
can connect every input port to a free output port without blocking.
Generally, a multistage network consists of n stages where N = 2” is the num-
ber of input and output lines. Therefore, each stage may use N/2 switch boxes.
The interconnection patterns from stage to stage determine the network topology.
Each stage is connected to the next stage by at least N paths. The network delay is
proportional to the number n of stages in a network, The cost ofa size N multistage
network is proportional to N log, N. The control structure of a network deter-
mines how the states of the switch boxes will be set. Two types of control structures
STRUCTURES AND ALGORITHMS FOR ARRAY PROCESSORS 339
are used in a network construction. The individual stage control uses the same
control signal to set all switch boxes in the same stage. In other words, all boxes
at the same stage must be set to assume that same state. Therefore, it requires #
sets of control signals to set up the states of all stages of switch boxes.
Another control philosophy is to apply individual box control. A separate
control signal is used to set the state of each switch box. This offers higher flexibility
in setting up the connecting paths, but requires n - N/2 control signals, which will
significantly increase the complexity of the control circuitry. A compromise
design is to use partial stage control,in which i + 1 control signals are used at stage
ifor0<i<#— 1. Various network topologies and control structures of both
recirculating and multistage inter-PE communication networks are described in
subsequent sections,
5.2.2 Mesh-Connected Iliac Network full details not needed LEARN THE HIGHLIGHTED
ONLY
A single-stage recirculating network has been implemented in the [lliac-[V array
processor with N = 64 PEs, Each PE, is allowed to send data to any one of PE,,,,
PE,_,, PE;,,,and PE,_, where r = ./'N (for the case of the Hliac-IV, r = ./64 =
8) in one circulation step through the network. Formally, the Illiac network
is characterized by the following four routing functions:
R,() = G+ ty mod N
R_) =G-— YmodN
(5.5)
RK.) = G +r) mod N
R_{}) = (i~ +r) mod N
where0 <i < N — 1. Inpractice,N is commonly.a perfect square, such as N = 64
and r = 8 in the Illiac-IV network.
A reduced Illiac network is illustrated in Figure 5.8a for N = 16 and r = 4.
The real Tlliac network has a similar structure except larger in size. All the index
arithmetic in Eq, 5.5 is modulo N. Comparing with the formal model shown in
Figure 5.5, we observe that the outputs of IS, are connected to the inputs of OS,
forj =i+1,i— 1,i+ r, andi — r, On the other hand, OS, gets its inputs from
IS; forj=j—-ULjy+1j—-r, ands + r, respectively.
Each PE; in Figure 5.8 is directly connected to its four nearest neighbors in
the mesh network. In terms of permutation cycles, we can express the above
routing functions as follows: Horizontally, all the PEs of all rows form a linear
circular list as governed by the following two permutations, each with a single
cycle of order N. The permutation cycles (a b c) (d e) stand for the permutation
a2>b, boc, ca and dose, esd ina circular fashion within each pair of
parentheses:
Ri, =(12+-N-1
(5.6)
R., =(N-1---210)
WWW.Gitmgurgaon.blogspot.com
34f) COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
It should be noted that when either the R , , or R_, routing function is executed,
data is routed as described in Eq. 5.6 only if all PEs in the cycle are active. When
the routing function R,, or R_, is executed, data are permuted as described in
Eq. 5.7 only if PE,.,, where 0 < k < r — | are active for each i. The shifting opera-
tion in a cycle will be suspended if any PE required in the cycle is disabled. For
anexample, the cycle (1 5 9 13) in the above permutation R, will not be executed
if one or more among PE,, PE,, PEs, and PE,, is disabled by masking.
The Hliac network is only a partially connected network. Figure 5.8) shows
the connectivity of the example Illiac network with N = 16. This graph shows that
four PEs can be reached from any PE in one step, seven PEs in two steps, and
eleven PEs in three steps. In general, it takes / steps (recirculations) to route data
from PE, to any other PE; in an Iliac network of size N where J is upper-bounded
by
I< f/N-] (5.8)
Without a loss of generality, we illustrate the cases when PE, is a source node
in Figure 5.8. PE,, PE,, PE,;,, or PE,; is reachable in one step from PEg. In two
steps, the network can route data from PE, to PE,, PE,, PE;, PEg, PE, ,, PE\3.
or PE,,4. In the worst case of three routing steps, the following eight routing
sequences take place in the network:
OF TRS 26 Ofs g MH 8 7
ORM IZA IE 7 OB 12 Es 8 7
o* 128s e449 0 tis 11 S10
o®s
7 ® 343 "859 ost is 85 44 8 10
WWW.Gitmgurgaon.blogspot.com
342 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
In the Hliac-IV computer, at most seven (,/64 ~ 1) steps are needed to route
data from any one PE to another PE. Of course, if we increase the connectivity
in Figure 5.8, the upper bound given in Eq. 5.8 can be lowered. We shall demon-
strate this by other network types in subsequent sections. When the network is
strongly connected (Le., with 15 outgoing links per node in Figure 5.8), the upper
bound on recirculation steps can be reduced to one at the expense of significantly
increased hardware in the crossbar network.
000001
010 dll
110 1]
Routing Function
the middle bit position. Horizontal lines differ in the least significant bit position.
This unit-cube concept can be extended to an n-dimensional unit space, called an
n cube, with n bits per vertex. A cube network for an SIMD machine with N PEs
corresponds to an n cube where n = log, N. We shall use the binary sequence
A = (a,.1°++@, 4,4}, to represent the vertex (PE) address forO < AS N-—1.
The complement of bit a, will be denoted as a, for anyO cis n— 1.
Formally, an n-dimensional cube network of N PEs is specified by the following
n routing functions:
0 0 0 0 0
1 ait 2) © [2 a) f
2, a4 1 \ ZL
34? Ia Te 3} J
4 4 4 4 ?
si Is 6| § {6 6 *
6 6 5 / \
7 d q 7 A 7 7 !
WWW .Gitmgurgaon.blogspot.com
344 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
where the ith bit off equals zero and PE, and PE,., , are both active, For an example
the routing function C, executed on a 3-cube network corresponds to the following
permutation over eight PEs:
P7=O09)059N26B6%7
Tf all the switch boxes in stage / are set to exchange, the network performs the
P; permutation at stage i. In general, the following multistage permutation is
conducted in an n-stage cube network:
n-1 {N~1
P= [| (1 Gi cu) (63.11)
i=O \j=0
where the ith bit j equals 0 and the stage i switch boxes whose inputs are labeled
as j and C,(j) are set to exchange. For the example design in Figure 5.10, the
permutation (0 1) (0 2) (0 4) = (0 1 2 4) is performed only if the top row boxes are
set to exchange and the rest are set to straight.
Masking may change the data-routing patterns in a cube network. The general
practice is to disable all PEs belonging to the same cycle of a permutation. In the
above example, if both PE, and PE, become inactive by masking, the cycles (2 6)
STRUCTURES AND ALGORITHMS FOR ARRAY PROCESSORS 345
are removed and the cube-routing function C, performs only the partial permu-
tation (0.4) (1 5) (3 7). However, if only PE, is disabled in the above example, the
above partial permutation will still be performed, but data in both PE, and PE,
will be transferred to PE,; causing a two-to-one conflicting transfer. PE, will not
receive any data, so the mapping will not be onto either. Masking should be
carefully applied to cube networks because of the send-active and receive-inactive
nature of data transfers among the PEs.
leave this
5.2.4 Barrel Shifter and Data Manipulator
Barrel shifters are also known as plus-minus-2' (PM21) networks. This type of
network is based on the following routing functions:
Bio = Ray
Big = RK.
° ' (5.13)
Baar =K,,
Boa = R.,
This implies that the Illiac routing functions are a subset of the barrel-shifting
functions, In addition to adjacent (+1) and fixed-distance (+r) shiftings, the
barrel-shifting functions allow either forward or backward shifting of distances
which are the integer power of two, ie, +1, +2, +4, 48,..., £2"%,..., $2074.
Instead of having just four nearest neighbors as in the [hac mesh networks, each
PE ina barrel shifter is directly connected to 2(# — 1) PEs. Therefore, the connec-
tivity ia a barrel shifter is increased from the Iliac network by having (2n ~ 5)-2"7!
more direct links. As demonstrated in Figures 5.85 and 5.1] for V = 16 (# = 4,
r = 4), the Illiac network has 32 direct links and the same size barrel shifter has
56 links. The two networks are identical only when the size is reduced to be no
greater than = 2 or N = 5,
The barrel shifter can be implemented as either a recirculating single-stage
network or as a multistage network. Figure 5.12 shows the interconnection patterns
in a recirculating barrel shifter for N = 8. The barrel shifting functions B19, By 1,
and B,., are executed by the mterconnection patterns shown, For a single-stage
barrel shifter of size N = 2", the minimum number of recirculations B is upper
bounded by
log, N
Bs 5 (5.14)
WWW.Gitmgurgaon.blogspot.com
346 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Routing functions
For the example barrel shifter with N = 16, it takes at most two steps to route
data from a PE to any other PE. If we assume PE, as the source node, PE, can
reach PE,, PE,, PE,, PEg. PE,,. PE,4, or PE,, in one step. In two steps, PE,
can reach PE,, PEs, PEg, PE;, PE), PE,,5, PE,,, or PE,3. Thus, one step is
saved by using the same size Iliac network. If one replaces the 64-node Hliac-IV
network by a 64-node barrel shifter, at most three routing steps are needed (in-
stead of seven steps). The speedup of a barrel shifter over the Hliac network of the
same size can be expressed by
oe J/N~1
MNT #1 5.1
Ss (log, N)/2 k o)
where N = 27, Therefore, the larger the network, the higher the speedup ratio.
For very large networks with N = 274, the speedup appreaches 2*/k, as demon-
strated in Table 5.1.
A barrel shifter has been implemented with multiple stages in the form of a
data manipulator. As shown in Figure 5.13, the data manipulator consists of a
stages of N cells. Each cell is essentially a controlled shifter. This network is
designed for implementing data-manipulating functions such as permuting,
replicating, spacing, masking, and complementing. To implement a data-manipu-
lating function, proper control lines of the six groups (wi', 43), AT’. hy, ay’, dz) in
each column must be properly set through the use of the control register and the
associated decoder.
The schematic logic circuit of a typical cell in a data manipulator is shown in
Figure 5.14. For0 <k < N — landO <i <n — 1, the dth cell at stage i (column
2'} has three inputs, three sets of outputs, and three control signals. Individual
stage control is used with three sets of contro! signals per stage. The control lines
uw, A’, and d' are connected to the AND gates in each cell of stage i. The u' line
controls the backforward barrel shifting (—2') and the dé Jine controls forward
barrel shifting (+2'). The horizontal line corresponds to no shifting under the
control of the h' signal, Note that stage 7 performs the distance 2! shiftings. By
passing data through the n stages from left to right, the shifting distance decreases
from 2"~! to 2"~? and eventually to 2! and 2° at the output end. Note that all the
2 16 3 2 1.50
3 64 7 3 2.33
4 256 15 4 3.75
3 1024 3 5 6.20
WWW.Gitmgurgaon.blogspot.com
348 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
CR{(o)
Column 2 2! 2
Tumber
CR: control register {o)
IMR: input mask register (a)
OMR: output mask register {u,)
Figure 5.13 The data manipulater for NW = 14. (Courtesy of JEEE Trans. (1computers, Feng 1974.)
shifting operations at all stages are module N. This is reflected by the wraparound
connections in the data manipulator.
In terms of permutations, the B,; routing function can be expressed by the
following product of 2! cycles by size 2"~* each:
dead
T] e k42) R42-2! be - 2 oe ko Nm BY
k=G
For the example network ofN = 8, the B, , function is represented by the following
permutation (0 2 4 6)(1 3.57). Similarly, for B_, we have the following permutation
in cycle notation:
2i-i
P] &+N—2 --. k43-2 k42-2 kt? BY
k=0
STRUCTURES AND ALGORITHMS FOR ARRAY PROCESSORS 349
ee To (k ~ 2th cell of
column 2'~!
From (« — 2/*)ch cell
of column 2'+!
t
From Ath cell of ik) To ath cell of
column 2/+! " column 2/-!
df? (k) am
From (k + 2/* ch cell
of column 2/*!
: : To (kK +2%th cell of
i i column 2'~!
Leese ce cee ebe ccna cere en ee cee et ewes emma
kth cell of column 2°
Figure 5.14 The logic design of an intermediate cell in the data manipulator.
. Substring flip
. Multiplicate spaced substring up
Cos
WWW.Gitmgurgaon.blogspot.com
350 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Me pes
TES:
8 m-Wway
2 2
3 3
4 4
$ 5
6 6
Fin
2 2
3 3
4 4
> 5
6 6
FD rat ——eiiion
Figure 5.15 The perfect shuffle and the inverse perfect shuffte for
(8) The inverse perfect shuffle AN = 8,
WWW.Gitmgurgaon.blogspot.com
352 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
shuffle cuts the deck into two halves from the center and intermixes them evenly.
The inverse perfect shuffle does the opposite to restore the original ordering. The
exchange-routing function E is defined by;
Ela, — 17+ 812g) = Fy y+ By Gn (5.19)
The complementing of the least significant digit means the exchange of data
between two PEs with adjacent addresses. Note that E(A) = Co(A), where Co
was the cube routing function defined in Eq. 5.9.
These shuffle-exchange functions can be implemented as either a recirculating
network or a multistage network. For N = 8, a single-stage recirculating shuffle-
exchange network is shown in Figure 5.16. The solid line indicates exchange and
the dashed line indicates shuffie. The use of a recirculating shuffie-exchange network
for parallel processing was proposed by Stone. There are a number of parallel
algorithms that can be effectively implemented with the use of the shuffle and
exchange functions. The examples include the fast Fourier transform (FFT),
polynomial evaluation, sorting, and matrix transposition, etc.
The shuffle-exchange functions have been implemented with the multistage
Omega network by Lawrie. The Omega network for N = 8 is illustrated in Figure
5.17. An N x N Omega network consists of a identical stages. Between two ad-
jacent stages is a perfect-shuffle interconnection. Each stage has N/2 switch boxes
under independent box control. Each box has four functions (straight, exchange,
upper broadcast, lower broadcast), as illustrated in Figure 5.6. The switch boxes
in the Omega network can be repositioned as shown in Figure 5.17) without
violating the perfect-shuffie interconnections between stages.
The n-cube network shown in Figure 5.10 has the same interconnection
topology as the repositioned Omega (Figure 5,17). The two networks differ in
two aspects:
1. The cube network uses two-function switch boxes, whereas the Omega network
uses four-function switch boxes.
2, The data-flow directions in the two networks are opposite to each other. In
other words, the roles of input-output lines are exchanged in the two networks.
Based on the above differences, the n-cube and Omega networks have different
capabilities even with isomorphic topologies. Suppose we wish to establish the
~. os
/. .
“"
‘ ~.Ss, on 1
3 St, t:
poe
* .
them
, é
‘A “~ ee
<
a
- aoa
Figure 5.16 Shuffle-exchange recirculating network for N = 8. (Solid lines are exchanges and dashed lines
are shufffe.}
STRUCTURES AND ALGORITHMS FOR ARRAY PROCESSORS 353
EEE UL tt
7 7] Ply a Ads |&
Stage 2 | Qo
0 0 D
teenie ro
1 aj 4 ae
2 t 2
3 5 B i 3
4 2 4
3 6 C K 5
6 3 6
7 7? fl metre
y
Stage 2 6 ;
Figure §.17 The multistage Omega
(b) the Omega network with switch box repositioned network proposed by Lawrie (1975).
I/O connections zere to five and one to seven. The Omega network (Figure 5.17a)
can perform this task, whereas the a-cube (Figure 5.10) network cannot. On the
other hand, the n-cube network can connect five to zero and seven to one, but the
Omega network cannot. In general, the Omega network can perform one to many
connections, while the n-cube network cannot. However, if one considers only
bijections (one-to-one connections), the n-cube and Omega networks are function-
ally equivalent by some relabeling techniques.
WWW.Gitmgurgaon.blogspot.com
354 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
If one applies the shuffle function S i times, written as 8‘, the binary sequence in
Eq. 5.18 will be shifted cyclically to the left i bit positions. The shuffle function S$
(Eq. 5.18) corresponds to the following permutation cycles:
N-1
where, for each cycle, } has not appeared in a previous cycle. The largest cycle in
the above permutation has order x. For N = 8, the shuffle function corresponds
to the permutation (0) (1 2 4) (3 5 6) (7).
The exchange function E (Eq. 5.19} can be expressed as a product of N/2
cycles of order two, provided that the index j is even:
N-2
Tg i+h (5.24)
f=0
Table 5.3 Lower and upper bounds on the number of transfers for the
network in row j to simulate the netwerk in columnj, where n = log, N
Bound Iliac Barrel Shuffie-exchange Cube
The original motivation for developing SIMD array processors was to perform
parallel computations on vector or matrix types of data. Parallel processing
algorithms have been developed by many computer scientists for SIMD computers.
Important SIMD algorithms can be used to perform matrix multiplication, fast
Fourier transform (FFT), matrix transposition, summation of vector elements,
matrix inversion, parallel sorting, linear recurrence, boolean matrix operations,
and to solve partial differential equations. We study below several representative
SIMD algorithms for matrix multiplication, parallel sorting, and parallel FFT.
We shall analyze the speedups of these parallel algorithms over the sequential
algorithms on SISD computers. The implementation of these parallel algorithms
on SIMD machines is described by concurrent ALGOL. The physical memory
allocations and program implementation depend on the specific architecture of a
given SIMD machine.
WWW.Gitmgurgaon.blogspot.com
356 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
For / = | tox Do
cj; = 0 (Ginilialization)
For k = | ton Do
(5.23)
Cry = Cy + ay - 6; (scalar additive multiply)
End of & loop
End of j loop
End of i loop
For} = l ton De
Par for k = | ton Do
Cu 0 (vector load)
Mh
Ce = Ca + yy - By (vector multiply)
End of j loop
End of i loop
se
° « *
e
a as ay
ay ay a,
- Unchanged
Marx . . . throughout
: : ; execution
4, Fa2 Gan
=
by Dy Dy
Dy bs >,
ati Unchanged
M a. . . throughout
. + . execution
° .
4, ba Pay
7
oy S Ow
¢ c. ¢. Change as
a 2 in shown in
Matrix Table 4.4
c . with all
e . 4
e zeros
initially
G co I en Ona
.
. . *
.
operations are needed in the double loops. The successive memory contents in
the execution of the above SIMD matrix multiplication program are illustrated
in Table 5.4. Each vector multiply instruction implies # parallel scalar multipli-
cations in each of the n? iterations. This algorithm in Example 5.4 is implementable
on an array of n PEs.
If we increase the number of PEs used in an array processor to m7, an O(n log, n)
algorithm can be devised to multiply the two n x n matrices A and B, Let n = 2”
and recall the binary cube network described in Section 5.2.3. Consider an array
processor whose n? = 2?" PEs are located at the 2?" vertices of a 2m-cube network.
A 2m-cube network can be considered as two (2m — 1)-cube networks linked
WWW.Gitmgurgaon.blogspot.com
358 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
A Crp lay Hayy X Oy Cy2 HC ya + iy & Dye Cin Cin Fig & Dan
2 i Ore Cog + gy X By C32 & C22 + Ay, X bya Cop * Con + An, X Oyy
2 Cay t Cag + Gag % bg, C22 + C22 + Ga2 X bay aq * Can + G22 X bon
n fx5 O l24 + dag % byy C22 — Can + Gan X bya Can Con ag X bry
a Can Olan Hn % Bay Cy2 Car FH Bay X By fon Can FH Gan X Pan
Pam—tPam=1+++Pm = Pm-1Pm-2+
++ Po (5.25)
as demonstrated in Figure 5.20a for the initial distribution of four rows of the
matrix 4 ina4 x 4 matrix multiplication (n = 4, m = 2). The four rows of 4 are
then broadcast over the fourth dimension and front to back edges, as marked by
row numbers in Figure 5.208.
The n columns of matrix B (or the n rows of matrix B‘) are evenly distributed
over the PEs of the 2m cubes, as illustrated in Figure 5.20c. The four rows of BY
are then broadcast over the front and back faces, as shown in Figure 5.20d. Figure
5.21 shows the combined results of A and B! broadcasts with the inner product
ready to be computed. The n-way broadcast depicted in Figure 5.20b and 5.20d
takes log n steps, as illustrated in Figure 5.21 in m = log,n = log,4 = 2 steps.
The matrix multiplication on a 2m-cube network is formally specified below,
STRUCTURES AND ALGORITHMS FOR ARRAY PROCESSORS 359
WWW.Gitmgurgaon.blogspot.com
360) COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Vil
Row 4
11
1101
The above algorithm takes a total of 3n log, n + O(n) time steps to complete,
which equals O(n log, n). This demonstrates a gain in speed over the O(n?)
algorithm in Example 5.4 at the expense of using n? PEs over the use of only #
PEs in the slow algorithm. In Chapter 10, we shall further show a VLSI hardware
approach to complete the n-by-n matrix multiplication in O(n) time using O(n?/m?)
VLSI processor arrays, each consisting of an array of O(n?) PEs for pipelined
inner-product computations.
34]
Ded
WWW.Gitmgurgaon.blogspot.com
362 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
tt
A.
Cr ~
ra
PE PE fee sweeter
eee nes cone ——| PE
nk
i: } i;
i ! ;
| | |
[ PE PE PH eeee eee PE
Figure 5.22 A mesh connection of PEs without boundary wraparound connections for SIMD sorting,
(c) Sorted pattern with (d) Sorted pattern with snakelike Figure 5.23 Sorting patterns with
shuffled row major row-major indexing respect to three ways of indexing
indexing the PEs.
algorithm can sort a? elements in a time of less than O(n). In other words, an O(n)
sorting algorithm is considered optimal on a mesh of n? PEs. Before we show one
such optimal sorting algorithm on the mesh-connected PEs, let us review Batcher’s
odd-even merge sort of two sorted sequences on a set of linearly connected PEs
shown in Figure 5.25. The shuffle and unshuffle operations can each be implemented
with a sequence of interchange operations (marked by the double-arrows in Figure
5.26). Both the perfect shuffle and its inverse (unshuffle} can be done in k — 1
interchanges or 2(k — 1) routing steps on a linear array of 2k PEs.
Batcher’s odd-even merge sort on a linear array has been generalized by
Thompson and Kung to a square array of PEs. Let M(j, 4) be a sorting algorithm
for merging two sorted j-by-k/2 subarrays to form a sorted j-by-k array, where
jand k are powers of 2 and k > 1. The snakelike row-major ordering is assumed in
all the arrays. In the degenerate case of M(1, 2), a single comparison-interchange
step is sufficient to sort two unit subarrays. Given two sorted columns of length
j > 2, the M(j, 2) algorithm consists of the following steps:
Il: Move ali odds to the left column and all evens to the right in 2¢, time.
J2: Use the edd-even transposition sort to sort each column in 2jt, + jr, time.
J3: Interchange on each row in 2¢, time.
J4: Perform one comparison-interchange in 2¢, + ¢, time.
WWW.Gitmgurgaon.blogspot.com
364 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Sorted Sorted
oa
cr Nf _~
UH Hare
He Hes
Li. Unshuffle: Odd-indexed elements of left, evens to right.
CHE wn _Y
HHH HEH
el
L2, Merge the subsequence of length 2.
The above M{j, 2) algorithm is illustrated in Figure 5.27 for an M(A, 2) sorting.
Whenj > 2 and k > 2, the M(j, &) sorting algorithm for a meshed-connected
array pracessor is recursively specified as follows:
M1: Ifj > 2, perform a single interchange step on even rows so that columns
contain either all evens or all odds. Ifj = 2, the columns are already
segregated, so nothing else needs to be done (time: 21,).
M2: Unshuffle each row (time: (k — 2}- t,}.
M3: Merge by calling algorithm M(/, &/2) on each half of the array [time:
Ts &/2}).
M4: Shuffle each row [time: (A — 2)-¢,].
MS: Interchange on even rows (time: 21,).
M6: Comparison-interchange adjacent elements (every even with the next
odd) (time: 44, + t,).
For j > 2 and k > 2, the M(j,k) sorting algorithm is illustrated in Figure
5.28 for the case of M(4, 4). Steps M1 and M2 unshuffle the elements. Step M3
recursively merges the odd subsequences and the even subsequences. Steps 4 and 5
shuffle the odd and even together. M6 performs the final comparison interchange.
Two sorted 4-by-2 subarrays are being merged to form a 4-by-4 sorted array in
snakelike row-major ordering. Let T(j, k) be the time required to perform all
the steps in the M(j, k) sorting algorithm. In the degenerated case ofk = 2, we have
Tf 2) = (27 + O)tg + (i + te (5.26)
Figure 5.27 Data routing, comparison, and interchange operations performed in the M(4, 2) sorting
algorithm. (Courtesy of LEEE Trans. Computers, Thompson and Kung 1977.)
WWW.Gitmgurgaon.blogspot.com
CLLeL Suny pas posdwoyy ‘ssayrdiusroy -sunay F377 Jo AsopineD) -(Buraapso
Jofeu Mol ayl[-ayeus sey Aeue px p payos peu ayy) UNWES—E FuyLos (pp) yy ey) ui sdays Juyplos aSzayay gz-¢ amaiy
Apiingns : ABLiDGns
BasOS ' — pazog
366
STRUCTURES AND ALGORITHMS FOR ARRAY PROCESSORS 367
WWW.Gitmgurgaon.blogspot.com
STRUCTURES AND ALGORITHMS FOR ARRAY PROCESSORS 369
SU)
0
PEO
8
4
PE 1
12
2
PE2
10
6
PE3
14
I
PE4
9
5
PES
13
3
PE6
IL
7
PE?
Figure 5.30 Computation of a 16-point FFT in an SIMD machine with 8 PEs. (Courtesy of IEEE Proc.
Sth inl Conf. on PRIP, Muelier, et al. 1980.)
network provides a natural means for specifying the interprocessor data transfers
required for the FFT algorithm. This network consists of n routing functions C,
for 0 <7 < log, N asin Eq. 5.9. Figure 5.30 illustrates data transfers and compu-
tations performed in the one-dimensional FFT algorithm for an SIMD machine
with N = M/2 PEs,
The above algorithm performs the M-point FFT calculations using log,(M/2)
parallel data transfers. This is a lower bound on the number of data transfers re-
quired to perform an M-point FFT when the M points are initially distributed
over M/2 PEs. The number of parallel butterfly operations performed is log, M,
where each butterfly involves two complex additions and one complex multiplica-
tion in each PE. The number of butterfly steps is reduced from (M/2)log,M ina
serial FFT algorithm to log,M fer a parallel FFT algorithm, using M/2 PEs.
Because M/2 PEs may not be available for the computation of an M-point
FFT, it is of interest to consider FFT algorithms which use fewer PEs. A simple
WWW.Gitmgurgaon.blogspot.com
370 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
solution for using fewer PEs is to replicate the steps in the M/2 PE algorithm.
For example, if N = M/4, two computations that were performed in parallel in
different PEs in the M/2 PE algorithm are now performed sequentially in the same
PE, as shown for N = 16 in Figure 5,31. The number of butterfly steps performed
is 2log,M. 2(log,M — 2) = 2{log,(M/4)] parallel data transfers are required.
This approach can be generalized to perform an M-point FFT in M/2* PEs for
2<k <log,M. For N = M/2* PEs, each PE will initially contain 2* elements.
The number of parallel butterfiy steps performed will be 2" 'log, M. The data
transfers will be performed the C, functions for log, M — k — 1 2 i = 0; each
C; function will be replicated 2"~ + times. The total number of parallel data transfers
will therefore be 2*~'(log,M — &).
We consider next the two-dimensional FFT algorithm for processing an
M-by-M signal array. A standard approach to computing the two-dimensional
FFT ofa signal array S is to perform the one-dimensional FFT on the rows of S,
giving an intermediate matrix G, then performing the one-dimensional FFT on
the columns of G. The resulting matrix F is the two-dimensional FFT of S. An
SIMD algorithm which uses N = M?/2 PEs is presented below.
The implementation of a two-dimensional FFT makes use of the previous
work done for one-dimensional FFTs. The PEs are logically partitioned into M
rows of M/2 PEs. Each row of PEs is given a row of the input matrix S, with two
matrix elements in each PE. The two-dimensional FFT is implemented by simul-
taneously having each row of PEs compute the FFT of its row of the input matrix
to obtain G. The PEs are then logically reconfigured to form M columns of M/2
processors, with each column of PEs having a column of G. Then each column of
PEs computes the FFT of its column of G to obtain F, the FFT of the input matrix.
This approach can be considered a row-column method in that it transforms the
rows of the matrix § to produce G, then transforms the columns of the intermediate
matrix G to produce F.
Initially, the PEs are logically configured as M rows of M/2 PEs, logically
numbered (i,j), where 0 < i < M and 0 <j < M/2. The physical address of PE
(i, }) is (M/2) + j. The physical address can be represented in binary as
P2y-2P2n-3
°°" Pu-1Pu-2°*' PiPo
where uz = log, M. Bits p,.2++-po are the binary representation of j, and bits
Pap~2'**Pu~1 are the binary representation of i. The input matrix § is distributed
such that PE (i, j) has Si, f) and S(i, 7 + M/2). Thus, each row of PEs can perform
the one-dimensional FFT on its rew of S with N = M/2. In this case, the cube
functions required for data transfers will exchange data based on the lower order
Ht — 1 bits of the physical address; i.c., the functions will act on j independently of i.
Thus, the one-dimensional FFT can be performed on each row independently
and simultaneously. The result G is distributed to each column of PEs which holds
two columns of G, with each PE holding one element from each of the two columns
of G.
The PEs are now logically reconfigured to form M columns of M/2 PEs, with
each column of PEs having one column of G. Two matrix elements are in each PE.
371
CORGL TE ¥ “TATAMIA! “ahd uo “file> [47 HIS “904g TEALJo AsenmoD) ‘sad fF WM BUTE CIAIS Gt uF 244 juted-gy & so wouEindateD [Eg andy
2 OO OG r A mM :
$1
>< ead
€
WWW.Gitmgurgaon.blogspot.com
¢ dd
1 ae
Odd
(y}s le) 85 '5 Vy (ua)s
372, COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
up, the speedup of a paralle] two-dimensional FFT algorithm is M2/2 over a serial
FFT algorithm. Without surprise, this speedup equals the number of PEs in the
SIMD array processor.
Two SIMD computers, the Goodyear Aerospace STARAN and the Parallel
Element Processing Ensemble (PEPE), have been built around an associative
memory (AM) instead of using the conventional random-access memory (RAM).
The fundamental distinction between AM and RAM is that AM is content-
STRUCTURES AND ALGORITHMS FOR ARRAY PROCESSORS 375
WWW.Gitmgurgaon.blogspot.com
376 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Mi
M eee oversee Masking register
e *
eee eeeee ci ci
* *
‘ W, (word) f ,
fn words
eee
:
eeeet
:
e
‘
a
. . .
° . °
N LJ
‘ 7 Indicator Temporary
m bits/word register
lsisnand! sjsm
* The 4, cell
respectively. Each of these registers can be set, reset, or loaded from an external
source with any desired binary patterns. The counters are used to keep track of the
i andj index values. There are also some match detection circuits and priority logic,
which are peripheral to the memory array and are used to perform some vector
boolean operations among the bit slices and indicator patterns.
The search key in the C register is first masked by the bit pattern in the M
register. This masking operation selects the effective fields of bit slices to be in-
volved. Parallel comparisons of the masked key word with all words in the associa-
tive memory are performed by sending the proper interrogating signals to all the
bit slices involved. All the involved bit slices are compared in parallel or in a
sequential order, depending on the associative memory organization. It is possible
that multiple words in the associative memory will match the search pattern.
Therefore, the associative memory may be required to tag all the matched words.
The indicator and temporary registers are mainly used for this purpose. The in-
terrogation mechanism, read and write drives, and matching logic within a typical
bit cell are depicted in Figure 5.33. The interrogating signals are associated with
each bit slice, and the read-write drives are associated with each word. There are
STRUCTURES AND ALGORITHMS FOR ARRAY PROCESSORS 377
information
Interrogation Mask* stored
information
0 {
0 0 0
| 0 Oo 0
Q i a l
| | I 0
Interrogate | interrogate 0
Write
input
~
Clear pe
Write e >
drive ™
2)
Read a
drive .
RK Ss
0 1
q
*
°
From : +
other a; Word
bits of output
the word
Y YY Y
Readout
Figure 5.33 The schematic logic design of a typical cell in an associative memory.
WWW.Gitmgurgaon.blogspot.com
378 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
two types of comparison readouts: the bit-cell readout and the word readout.
The two types of readout are needed in two different associative memory organiza-
tions.
In practice, most associative memories have the capability of word parallel
operations; that is, all words in the associative memory array are involved in the
parallel search operations. This differs drastically from the word serial operations
encountered in RAMs. Based on how bit slices are involved in the cperation,
we consider below two different associative memory organizations:
The bit parallel organization Ina bit parallel organization, the comparison process
is performed in a parallel-by-word and parallel-by-bit fashion. All bit slices which
are not masked off by the masking pattern are involved in the comparison process.
In this organization, word-match tags for all words are used (Figure 5,34). Each
cross point in the array is a bit cell. Essentially, the entire array of cells is involved
in a search operation.
Bit serial organization The memory organization in Figure 5.34) operates with
one bit slice at a time across all the words. The particular bit slice is selected by an
extra logic and control unit. The bit-cell readouts will be used in subsequent
bit-slice operations. The associative processor STARAN has the bit serial memory
organization and the PEPE has been installed with the bit parallel organization.
The associative memories are used mainly for search and retrieval of non-
numeric information. The bit serial organization requires less hardware but is
slower in speed. The bit parallel organization requires additional word-match
detection logic but is faster in speed. We present below an example to illustrate the
search operation in a bit parallel associative memory. Bit serial associative
memory will be presented in Section 5.4.3 with various associative search and
retrieval algorithms.
Example 5.8 Consider a student-file search in a bit parallel associative mem-
ory, as illustrated in Figure 5.35. The query needs to search all students whose
age is not younger than 21 but is younger than 31. This requires performing the
not-less-than search and the less-than search on the age field of the file. Two
matching patterns are used in the two subsequent searches. The masking
pattern selects the age field. The lower-bound 21 is loaded into the C register
as the first key word. Parallel comparisons are performed on all student records
(words) in the file. Initially, the indicator register is cleared to be zero.
After the first search, those students who are not younger than 21 are
marked with a / in the indicator register, one bit per each student word. This
matching vector is then transferred to one of the Tregisters. Then the upper-
bound 31 is loaded into C as the second matching key. After the second search,
a new matching vector is sent to the I register. A bitwise ANDing operation
is then performed between the J and T registers with the resulting vector re-
siding in the / register as the final output of the search process. The whole
search process requires only two accesses of the associative memory. An output
circuit (shown in Figure 5,34) is used to control the reading out of the result.
STRUCTURES AND ALGORITHMS FOR ARRAY PROCESSORS 379
il 12 in Word-match tag
network |
21 22 2A | Word-match tag
network2
°
*
e¢ @ s
e
.
| aoe
\ eos
¥ eee
21 22 2H | Word
logic 2
° Y eee y
oe @ :
. t een
aa m2 min Word
logic m
14 t
Output circuit fr ALU
WWW.Gitmgurgaon.blogspot.com
380 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Query: Search for those students whose ages are in the range (21, 31)
f T
Ford 0 EE 25 3 1 i
Nixon i CE 19 { 0 i
Smith | ME 28 4 l 1
Jones 0 Math 33 4 I 0
i EE 21 2 1 !
cd .
e - « * - o ®
° *
Brown Physics 31 3 I I
Peterson 0 Chem, 20 2 0 i
L_
Name Sex Dept. Age Class { t
The PEPE architecture There are two types of fully parallel associative processors:
word-organized and distributed logic. In a word-organized associative processor,
the comparison logic is associated with each bit cell of every word and the logical
decision is available at the output of every word. In a distributed-logic associative
processor, the comparison logic is associated with each character. cell of a fixed
number of bits or with a group of character cells. The most well-known example
of a distributed-logic associative processor is the PEPE. Because of the require-
ment of additional logic-per-cell, a fully parallel associative processor may be
cost prohibitive. A distributed-logic associative processor is less complex and thus
less expensive. The PEPE is based on a distributed-logic configuration developed
at Bell Laboratories for radar signal-processing applications.
A schematic block diagram of PEPE is given in Figure 5.36. PEPE is composed
of the following functional subsystems: an output data control, an element memory
control, an arithmetic control unit, a correlation control unit, an associative
output control unit, a control system, and a number of processing elements, Each
processing element (PE) consists of an arithmetic unit, a correlation unit, an asso-
clative output unit, and a 1024 x 32-bit element memory. There are 288 PEs
organized into eight element bays. Selected portions of the work load are loaded
from a host computer CDC-7600 to the PEs. The loading selection process is
determined by the inherent parallelism of the task and by the ability of PEPE’s
unique architecture to manipulate the task more efficiently than the host computer.
Each processing element is delegated the responsibility of an object under observa-
tion by the radar system, and each processing element maintains a data file for
specific objects within its memory and uses its associative arithmetic capability
to continually update its respective file.
PEPE represents a typical special-purpose computer. It was designed to
perform real-time radar tracking in the antiballistic missile (ABM) environment.
No commercial model was made available. It is an attached array processor to the
general-purpose machine CDC 7600, as demonstrated in Figure 5.36.
The bit-serial STARAN organization The full parallel structure requires expensive
logic in each memory cell and complicated communications among the cells.
The bit. serial associative processor is much less expensive than the fully parallel
structure because only one bit slice is involved in the parallel comparison at a
time. Bit serial associative memory (Figure 5.34b) is used. This bit serial associative
processing has been realized in the computer STARAN (Figure 5.37). STARAN
consists of up to 32 associative array modules, Each associative array module
contains a 256-word 256-bit multidimensional access (MDA) memory, 256 pro-
cessing elements, a flip network, and a selector, as shown in Figure 5.384. Each
processing element operates serially bit by bit on the data in all MDA memory
words. The operational concept of a STARAN asscciative array module is
illustrated in Figure 5.38d.
Using the flip network, the data stored in the MDA memory can be accessed
through the I/O channels in bit slices, word slices, or a combination of the two.
The flip network is used for data shifting or manipulation te enable parallel
WWW.Gitmgurgaon.blogspot.com
382 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
tl eee 36 1 eee 36
ote
Processing Processing
Elements Elements
: :
poten nneeteen eng ' Output Element :
: i : data memory '
: : i control control :
! 4 ' .
' : ; I 7 :
» CDC 7600 : i 4 ;
i (Host: ! | ee j
{ computer} | } ; " — 7 ‘
: ' | | Arithmetic Associative Correlation |:
Hl : : control outpul control [3
: \ ' unit control unit unit :
‘ i ' t
‘ 1 : :
i a | i
: ;
‘ i Control console
: Radar t Pot teen ea nan reccanetecertas stanee nea nare ean nanet aerate eres ey
{ interface + i !
\ computer j : . ; t
‘ rr Sequential control logic +-——-#
| 5 ‘ '
beens cena se mene ad i
' :i
‘ i
:t : .
i‘
‘ Parailel instruction queue '
;1 :v
i' f:
: Parallel instruction control 4
!'‘ unit
‘
i:
i !
Control system
Figure 5.36 The Parallel Element Precessing Ensemble (PEPE). (Courtesy of Proc. of National Elec-
tranics Conference, Berg et al. 1972}
STRUCTURES AND ALGORITHMS FOR ARRAY PROCESSORS 383
i { module 6
i «Computers
{_PIO
“4
PIO Associative Conirol
array signals
i @Peripherals ; module a
; *Displays |
{ Sensors :
' i Toor t reser tosses secon wceena sons eccwe
ran aes adem ccc eeg
q i EXE EXF |'
t-—]
.
External function logic
. 1
‘
{ bt
i t
t : ' | '
{ i '
{ ' }+| Sequential
q Program
£ Associative f '|
1 '
i
‘i ' control
i pager
: controk
: ‘
‘ ‘ : logic logic logic '
i ! + t
:{ :
! !
i i!
i i 1 :
|i
: i
: BIO BIO :i
:
i
!
i P — : Memory port |
j — :
:
logic; :
.
ji }
DMA DMA || Associative processor
:i
\ i i contrel memory :
! : : ;
:i $i Lob : bee we n eke cee eae ee nee nee eee j
Control system
Figure 5.37 The STARAN system architecture, (Courtesy of Proc. of National Computer Conference,
AFIPS Press, Batcher 1974,}
search, arithmetic or logical operations among words of the MDA memory. The
MDA was implemented by Goodyear Aerospace using random-access memory
chips with additional XOR logic circuits. The first STARAN was instalied for
digital image processing in 1975, Since then, Goodyear Aerospace has announced.
some enhanced STARAN models. The size of the MDA memory has increased to
9216 x 256 per module in the enhanced model with improved 1/O and processing
speed,
To locate a particular data item, STARAN initiates a search by calling for a
match specified by the associative control logic. In one instruction execution, the
data in all the selected memories of all the modules is processed simultaneously
by the simple processing element at each word. The interface unit shown in Figure
5.37 involves interface with sensors, conventional computers, signal processors,
interactive displays, and mass-storage devices. A variety of I/O options are
WWW.Gitmgurgaon.blogspot.com
384 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Multidimensional ALU
access memory (256 processing
(256 « 256) elements}
4 d [ A
y 7
Input
7 pe Selector nt:
Y
Flip
Output {permutation}
" network oma
\ Control
signals
To/from control
Q Bits 258
T , T jet pe ;
or O <p 4
Words 7T~Bit-stice . \
4] : ,
g PIO
SOONER RES ahead
Z(4A Word-slice :
*
<_<
255 F- (A a
256 words x 256 bits 256 PES
Associative memories are mainly used for the fast search and ordered retrieval of
large files of records. Many researchers have suggested using associative memories
for implementing relational database machines. Each relation of records can be
arranged in a tabular form, as illustrated in Example 5.8. The tabulation of records
(relations) can be programmed into the cells of an associative memory. Various
associative search operations have been classified into the following categories by
T. Y. Feng (1976).
Extreme searches
Equivalence searches
WWW.Gitmgurgaon.blogspot.com
386 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Proximate-to: Search for those records that satisfy a certain proximity (neighbor-
hood) condition.
Threshold searches
Smaller-than: Search for those records that are strictly smaller than the given key.
Greater-than: Search for those records that are strictly greater than the given Key.
Not-smaller-than: Search for those records that are equal to or greater than the
given key.
Not-greater-than: Search for those records that are equal to or smaller than the
given key.
Adjacency searches
Nearest below: Search for the nearest record which is smaller than the key.
Between-limits searches
{X, ¥]: Search for those records within the closed range {z|X < z < Y}.
(X, Y): Search for those records within the open range {z|X < z < ¥}.
{X, ¥): Search for those records within the range {z|X <z < Y}.
(X, Y]: Search for those records within the range {2)X < z < Y}.
Ordered retrievals
Listed above are primitive search operations. Of course, one can always com-
bine a sequence of primitive search operations by some beolean operators to form
various query conjunctions. For examples, one may wish to answer the queries
equal-to-A but not-equal-to-B; the second largest from below; outside the range
EX, Y]; etc. Boolean operators AND, OR, and NOT can be used to form any
query conjunction of predicates. A predicate consists of one of the above relational
operators plus an attribute such as the pairs {<, A} or {%, A}. The above search
operators are frequently used in text retrieval operations.
Two examples are given below to show how to perform associative search
operations. We consider the use of a bit serial associative memory (Figure 5.34)
in which ail the memory words can be accessed (read) in parallel, but where bit
slices of ail words or within a specified field of all words must be processed sequen-
tially one slice after another from left to right in the AM array. The following nota-
tions are used to designate any specific field of a word:
Example 5.9: The MINIMA search This algorithm searches for the smallest
number among a set of # positive numbers stored ina bit serial AM array. Each
number has f bits stored in a field of a word from the bit position s to the bit
position s + f — 1:
1. Initialize
CHI; KO) 1:70) O;k =1j=stk—1,M—(...11.,.10...0).
S bits of 1s
2. If C,=0, then load Tk) with T(k — 1)9 (C, @ B,,); else modify I by
applying Rk) = T(k) mo (C, @ B;,) for alli = 1,2,...,n.
3. Increment k — & + 1. Then test if k =f If no, proceed to step 2. If yes,
read out those qualified numbers indicated by I,(f) = 1.
In steps 2 and 4 of Example 5.9 and in step 2 of Example 5.10, all » words are
involved in the specified operations. The bit-slice operations are governed by the
increment of index k in the loop. When a bit parallel associative memory is used,
WWW.Gitmgurgaon.blogspot.com
388 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
the above algorithms have to be modified. The bit parallel associative memories
were only used in nonnumeric text retrieval operations. The bit serial associative
memories were mostly used to perform associative numeric computations.
Problems
5.1 Explain the following terminologies associated with SIMD computers:
(a) Lock-step operations (7) Barrel-shifting functions
(5) Masking of processing elements {g) Shuffie-exchange functions
(ec) Routing functions for Tilac network (A) Associative memory
(@} Recirculating networks i Bit serial associative processor
(e) Cube-royting functions (/) Adjacency search
$.2 You are asked to design a data routing network for an SIMD array processor with 256 PEs. Barrel
cyclic shifters are used so that a route from one PE to another requires only one unit of time per integer-
power-of-two shift in either direction,
STRUCTURES AND ALGORITHMS FOR ARRAY PROCESSORS 389
(a) Draw the interconnection barrel shifting network, showing all directly wired connections
among the 256 PEs. In the drawing, at least one node (PE) must show all its connections to other PEs.
(6) Calculate the minimuni number of routing steps from any PE; to any other PE,,., for the
arbitrary distance 1 < k < 255, Indicate also the upper bound on the minimum routing steps required.
5.3 Consider 64 PEs (PE, to PE,3} in the Illiac-1V. Determine the minimum number of data-routing
steps needed to perform the following inter-PE data transfers: PE; to PEs moaea Where 0 <i < 63
andO sk = 63.
5.4 Consider the use ofa four-PE array processor to multiply two 3 x 3 matrices. The interconnection
structure of the four PEs is shown in Figure 5.39. Wraparound connections appear in ail rows and
columns of the array. You need io map the matrix elements initially one to each of the processors.
All the 3 multiplications needed for each output element ¢,; must be performed in the same PE in
order to accumulate the sum of products. Of course, you are allowed (o shift the mairix elements around
if needed,
(2) Show the initial mapping of the 4 and & matrix elements io the processors before the first
multiplication is carried out. (You may have to wrap around the matrix.)
(5) What are the initial multiplications to be carried out in each processor (there may be more
than one multiplication in each processor) without any data shifting?
(c) Parallel shifts are carried out in the horizontal and vertical directions. Show the mapping of
the 4 and # matrix elements to the processars before the second group of multiplications can be
carried out.
(2) What are the multiplications to be carried out in each processor without any further shifting?
(Don’t bother with summing with the previous terms. Summation operations in dot product operations
are embedded in the multiply hardware automatically.)
PE, PE,
PE, PE,
| ___d
9 42 My yy by Oy 1 Sr a
4% 43, Ty, | & by by by] = | or 2
Figure 5.39 The multiplication of (vo 3 x 3 matrices on a mesh of 4 PEsin Problem 5.4,
WWW.Gitmgurgaon.blogspot.com
390 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
(e) Suppose the two matrices have already been allocated as in part (a). Assume each processor
can perform one multiplication per unit Gime or one shift in a single direction per unit time for each
number. (1f you shift two numbers, it takes two units of time.) Determine the time units needed to
complete parts (4), (c), and (@), respectively, Minimizing the total time delay is the design goal.
§.5 Let 4 be a 2" x 2” matrix stored in row-major order in the main memory. Prove the transposed
matrix A? can be obtained by performing m perfect shuffles on A.
5.6 Givenana x amatrix 4 = (a,,;), we want to find the x column sums:
an}
S,2 ST ay forjs1,2...,7~1
io
using an SIMD machine with n PEs. The matrix is stored in a skewed format, as shown in Figure 5.40,
The jth column sum S$; is stored in location 6 in PEM, at the end of the computation. Use the machine
organization as shown in Figures $.l@ and §.2, Write an SIMD algorithm and indicate the successive
memory contents in the execution of the algorithm.
5,7 Consider the use of the associative memory array in Figure 5.32 for implementing a Not-Equal-To
search. Assume bit-slice parallel-word operations similar ta those described in Examples 5.9 and 5,10,
Write out the detailed steps, Initial conditions in registers and the intermediate and final indicator
pattern must be interpreted,
4.8 (a) Benes binary network is a type of multistage network which is rearrangeable and nonblock itg
because ii can perform all possible connections between inputs and outputs by rearranging its existing
connections so that a new path for ancw input-output pair can always be established, Develop a routing
algorithm (o realize the following permutation in an 8 x 8 Benes network:
(olzaase?
P@\37402615
Contral setting of the input and output switching elements is shawn in Figure 5.41 from the first lieration
of the algorithm. The algorithm to be developed is recursive in nature. lt can be applied recursively ta
the two middle subnetworks, labelled @ and 6 in the figure.
o 0 | My
atl nel 40 fi nn2
cy
° . oan *
es
.
. o s
cy
B So 5 Sa
CT : :
: Figure 5.46 Memory allocation for the
matrix Computations in Problem 5.6.
STRUCTURES AND ALGORITHMS FOR ARRAY PROCESSORS 39]
Subnetwork a
i ay = :
2 ‘ : 2
3 , > > 3
4 : ‘N + Pte
$ : be 5
6 4 ! pr 6
7 . > > 4 > Pte7
Subnetwork 6.
Figure 5.41 An8 x & Benes network tor Problem 5.8.
(d) Classify those multistage SIMD interconnection aetworks you have studied according to the
three distinct features blocking, rearrangeable, and nonblocking.
5.9 Consider an N-input Omega network where each switch cell is individually contralled and 4 = 2".
(2) How many different permutation functions (one-to-one and onto mapping) can be defined
over N inputs?
(6) How many different permutation functions can be performed by the Omega network in one
pass? If MV = 8, what is the percentage of permutation functions that can be performed in one pass?
(c¢) Given any source-destination (S — PD) pair, the routing path can be uniquely controlled by
the destination address. Instead of using the destination address (DP) as the routing tag, we define
T = § @ Das the routing tag. Show that T alone can be used to determine the routing path. What is
the advantage of using Tas the rauting tag?
(d) The Omega network is capable of performing broadcasting (ane source and multiple destina-
tions). If the number of destination PEs is a power of two, can you give a simple rouring algorithm to
achieve this capability?
$.10 How many steps are required to broadcast an information item from one PE to all ather PEs
in each of the following single-stage interconnection networks? (N = 2" PEs),
(e) A shuffle-exchange network. Each siep could be either a shuffle or an exchange but not mixed.
(6) A cube network. The €, routing is performed for each step ,0 < és a — 1.
$5.11 Prove or disprove that the Omega network can perform any shift permutation in one pass. The
shift permiutarion is defined as follows: given NM = 2” inputs, a shift permutation (5 etther a circular left
shift or a circular right shift of& positions. whereQ < A < N.
5.12 A polynomial, p(x) = 3? a,x can be evaluated in 2 log, N steps in an SIMD computer with
N PEs, where 4 is a power of 2. Assume that each PE can perform either an aed or a muttipiy under
masking control. A shuffle interconnection exists among the ¥ PEs. Each PE has « data register and
WWW.Gitmgurgaon.blogspot.com
392 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
do {) My 0 ay 9 ay
ay I ax 0 Cres 0 ayx
ay 0 ty 1 ax? 0 yx?
ay l 3x | ay Q ax?
as 0 ay 0 dg j aygx*
ats l gx 0 fsx 1 agx?
a 0 de 1 ax? I Og x®
x 1 x? 1 xt 1 x
amask flag. The machine is equipped with both broadcasting and maskiag capabilines (instructions). To
evaluate the polynomial requires first to generate all the product terms; a,x‘ for) = 0,1,...,¥ - 1,
by a sequence of log, M shuffle-multiply operations as listed in Table 5.6, and then to generate the sum
of all product terms by a sequence of log, N shuffle-add operations.
(a) Show the major components and interconnection structure of the desired SIMD machine for
the size of N = 8.
(6) Figure out the exact sequence of SIMD machine instructions needed to carry out the shuffle-
multiply sequence in Table 5.6, The sfufffe instruction is used to generate the successive mask vectors,
The PE operates by broadcasting the successive multipliers, x, x7, and x*, retrieved from the eighth
data register.
(c) Before entering the shuffle-add sequence, the eighth data register should be ceset to zero.
Repeat question (4) for the summing sequence. At the end, the final sum can be retrieved from any one
of the eight PE registers.
(d) Explain the advantages of using the shuffle interconnection network for the implementation
of the polynomial evaluation algorithm, as compared with the use of the Illiac mesh network for the
same purpose.
CHAPTER
SIX
SIMD COMPUTERS AND PERFORMANCE
ENHANCEMENT
We divide the space of SIMD computers into five subspaces, based on word-slice
and bit-slice processing and the number of control units used:
' wos (word slice), bis (bit slice), ass (associative), array (array processor).
? Original Illiac design had MSIMD with 4 CUs and 256 PEs.
WWW.Gitmgurgaon.blogspot.com
396 COMPLTER ARCHITECTURE AND PARALLEL PROCESSING
PE ttt PE pat oe P PE
Master control
Figure 6.1 Linger’s spatial computer concept. (Courtesy of Proc. of FRE, October 1958.)
are given. At present, most associative processors are designed to perform fast
information retrieval and database operations. The RELACS is an associative
database machine proposed by researchers at Syracuse University in 1979. It is
based on using staged memory between the disks and the host processor. Associa-
tive memories are used to implement relational database operations. We have
already studied the PEPE and the STARAN architecture in Section 5.4.2. The
STARAN is the only commercial associative processor that has several installa-
tions in operation at present.
~t VP) >
Accumulator
| I
Shared
memory
modules
.
: Arithmetic
pipelines
{ °
es
*
tt + VP,
Virtual 4
Processors
Figure 6.2 The vector arithmetic multiprocessor (VAMP) proposed by Senzig and Smith. (Courtesy of
AFIPS Proc. FICC, 1965.)
WWW.Gitmgurgaon.blogspot.com
398 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
[80-100] po]
Iliac WV . Phoenix
") (1972) 7 Yy
DAP
(1978) 10-30]
Figure 6.3 The family tree of SIMD array processors (numbers in brackets are estimated peak performance
in mftops).
Presented below are the system architecture, hardware and software features, and
application requirements of the Llliac-EV and the BSP computer systems. The
Illiac-LV system was developed at the University of Illinois in the 1960s. The
system was fabricated by the Burroughs Corporation in 1972. The original objective
was to develop a highly parallel computer with a large number of arithmetic units
to perform vector or matrix computations at the rate of 10° operations per second.
In order to achieve this rate, the system was to employ 256 PEs under the super-
vision of four CUs. Due to cost escalation and schedule delays, the system was
ultimately limited to one quadrant with 64 PEs under the control of one CU. The
speed of the 64-PE quadrant is approximately 200 million operations per second.
The IIliac-IV computer has been applied in numerical weather forecasting and in
nuclear engineering research, among many other scientific applications.
SIMD COMPUTERS AND PERFORMANCE ENHANCEMENT 399
Control unit
ADB Accumulators
Simple
ALU
Common + Insiruction
data control
bus path
eee
To PE
63 Rouling
network
eee
Figure 6.4 A 64-PE Iliac [V array. (Courtesy of FEEE Proc. Bouknight et al., April 1972.)
WWW.Gitmgurgaon.blogspot.com
400 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
The common data bus is used to broadcast information from the CU to the
entire array of 64 PEs. For example, a constant multiplier need not be stored 64
times in each PEM; instead, the constant can be stored in a CU register and then
broadcast to each enabled PE. Special routing instructions are used to send
information from one PE register to another PE register via the routing network.
Standard load or store instructions are used to transfer information from PE
registers to PEM. At most, seven routing steps are needed to transfer information
among the PEs via the mesh network. The software figures out the shortest routing
path in each data-routing operation. The mode-bit line consists of one line coming
from the A register of each PE in the array. These lines can transmit the mode bits
of the D register in the array to the accumulator register in the CU. There are CU
instructions which can test the mask vector and branch on a zero or nonzero
condition.
The Illiac-IV communicates with the outside world through an 1/O subsystem
(Figure 6.5), a disk file system, and a B6500 host computer which supervises a large
laser memory (10! bits) and the ARPA network link. The disk has 128 heads, one
per track, with a 40-ms rotation speed and an effective transfer rate of 10° bits per
second. The B6500 manages all programmer requests for system resources. The
operating system, including compilers, assemblers, and 1/O service routines, are
residing in the B6500. As a total system, the Illiac-I1V array is really a special-
purpose back-end machine of the B6500. The ARPA net linkage makes the Illiac-IV
available to all members of the ARPA network.
The control unit (CU) of the Illiac-IV array performs the following functions
needed for the execution of programs:
10!
bit
laser ARPA
memory network
j link
j 4
Y Y
Disk file | Vo
syslem subsystem
Figure 6.5 The IHiac TV I/O System.
SIMD COMPUTERS AND PERFORMANCE ENHANCEMENT 401
From PE memory
¥512 512
Instruction Associative toca
buffer + « memory buffer
(PLA) (CAM) (LDS)
j
f32
Instruction Program
REG counter
| ACAR 3
244 { 64 ¥ 24
24
1 64 4
Sequencer Broadcast
(FENST) data
ye ry \g300 eo
Control signals Common data bus 1/0 request Mode F/F
from PEs 1o PEs from [/O from PEs
Figure 6.6 Functional block diagram of the Hliac [V control unit. (Courtesy of ZEEE Proc. Bouknight
et al., April 1972.)
WWW.Gitmgurgaon.blogspot.com
402 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
addition, subtraction, shift and logic operations. More complex and vector
arithmetic logic operations are relegated to the PEs. Address arithmetic is per-
formed by the 24-bit address adder. The final queue is used to stack the addresses
and data waiting to be transmitted to the PEs.
All instructions are 32 bits wide and classified as either CU instructions or
PE instructions. CU instructions are for program control (indexing, jumps, etc.)
and scalar operations. PE instructions are decoded by the advanced instruction
station (ADVAST) and then transmitted via control signals to all PEs. In fact, the
ADVAST decodes all instructions and executes the CU instructions. The ADVAST
constructs the necessary address and data operands after decoding a PE instruc-
tion. The PLA instruction buffer can hold [28 instructions, sufficient to hold the
inner loop of many programs.
A block diagram of the processing element is shown in Figure 6.7. The PE
computes with the distributed data and reforms local indexing for skewed memory
fetch. Major components in a PE include:
Each PE has a 64 bit wide routing path to four neighbors. To minimize the
physical routing distance, the PEs are grouped as shown in Figure 6.7. This
drawing has been logically described in Figure 6.8 for a smaller network size.
Routing by a distance of plus or minus eight occurs interior to each group of eight
PEs. The CU data and instruction fetches require blocks of eight words, which
are accessed in parallel. The individual PEM is a thin-film memory with a cycle
time of 240 ns and an access time of 120ns. Each has a capacity of 2048 words.
Each PEM is independently accessible by its attached PE, the CU, or other [/O
connections, The computing speed and memory of the Illiac-IV arrays require a
substantial secondary storage for program and data files. A backup memory is used
for programs with data sets exceeding the fast-memery capacity.
NEWS
LA Control unit
Drivers |_|
and MIR CDB Drivers Mode
receivers and Fe] register
7
receivers (RGM)
R register
(RGR)
Address
1 adder
S register Operand (ADA)
(RGS) select
l
gales
on (OSG)
Multiplicand
select
gales
(MSG) haf
B register X register
tr (RGB) (RGX)
Pseudoadder 7
tree
(PAT) Memory
address
bem Memory
| registers
i t L (MAR)
‘ary
propagate
adder
(CPA)
att Litd
. 1 Logic
A register unit
(RGA) (LOG)
Leading Barrel
ones switch
detector (BSW)
(LOD) "
Figare 6,7 The internal structure of the processing element in Iliac [V. (Courtesy of JEEE Proc. Bouk-
night et al., April 1972.)
WWW.Gitmgurgaon.blogspot.com
404 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
WikRah
vte pe
Seek -EG
=Ly
ne
pee
Pee
ate
rom— Shifts of 20
———~ Shifts of 21
These two Fortran statements will be compiled into a sequence of machine instruc-
tions which include the initialization of the loop, the looping-control instructions,
the component-addition instruction, and the storage of the result. The initialization
instructions are outside the loop. All the remaining machine instructions must be
executed N times in the loop.
The Hiiac-1V can perform the additions in the loop simultaneously by involving
all 64 PEs in synchronous lock-step fashion. The data must be allocated in the
SIMD COMPUTERS AND PERFORMANCE ENHANCEMENT 405
Example 6.1
Case 1: N = 64 (The array matches the problem size) Only three Illiac-IV
machine instructions are needed to implement the Eq, 6.1 loop. The 64
components of the A, B, and C arrays are allocated in memory locations «,
a+ 1, and « + 2 of the PEMs, respectively, as shown in Figure 6.9. The
machine instructions are:
Note that all the 64 loads in LDA, the 64 adds in ADRN, and the 64 stores in
STA are performed in parallel in only three machine instructions. This means
a speedup 64 times faster than a conventional serial computer.
Case 2; V < 64 (The problem size is smaller than the array size) [n this case, only
a subset of the 64 PEs will be involved in the parallel operations. The same
memory allocation and machine instructions as in case | are needed, except
some of the memory space and PEs will be masked off. The smaller the value
ca
e . .
.
o
* e «
e
Location 2047
PEM, PEM, PEM,,
Figure 6.9 Data allocation in PEMs to execute the program:
DO 10 |= 1, 64
10 AQ) = RC} = Ca)
WWW.Gitmgurgaon.blogspot.com
406 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
N compared to 64, the severer the idleness of the disabled PEs and PEMs in
the array.
Case 3: V > 64 (The problem size is greater than the array size) The memory
allocation problem becomes much more complicated in this case. The case
of N = 66 is illustrated in Figure 6.10. The first 64 elements of the A, B, and C
arrays are stored from locations », « + 2,and a + 4, respectively, in all PEMs.
The two residue elements A(65), A(66); B(65), B(66); and C(65), C(66) are
stored in locations « + 1,%+ 3,and« + 5, respectively, in PEM, and PEM,.
The unused memory locations are indicated by question marks. Six machine-
language instructions are needed te perform the 66 load, add, and store
operations:
1. Load the accumulator from location « + 4.
2. Add to the accumulator the contents of location « + 2.
3. Store the result to location «.
4, Load the accumulator from location x + 5.
5. Add to the accumulator the contents of location « + 3.
6. Store the result to location ¢ + 1.
es
. e e
a
e
o °° e
e
Location 2047
PEM PEM PEM
a t 63
Figure 6.10 Data allocation in PEMs to execute the program.
DO 10 1 = 1, 66
10 AM) = Ril) + CH)
SIMD COMPUTERS AND PERFORMANCE ENHANCEMENT 407
Thetwo residue data items in the A, B, and C arrays require three additional
Iliac instructions. In fact, the above six instructions can be used to perform
any vector addition of dimensions 65 < N < 128 in Illiac-IV. The particular
storage scheme shown in Figure 6.10 wastes almost three rows of storage
(62 x 3 = 186 words).
DO 100 1=2,64
100 A(I)=B(I)+AC-1) (6.2)
This recursive loop demands the following set of Fortran statements to be executed
sequentially:
A(2)=B(2)+A(1)
A(3)=B(3)+A(2)
A(63) = B(63) +A(62) INSTRUCTIONS NOT NEEDED
A(64)=B(64)+A(63)
CO-3 PORTIONS
COMPLETED HEREEEEEEE
A(2)=B(2)+A(1)
A(3)=B(3)+B(2)+A(1)
A(4) =B(4) +B(3)+B(2)+A(2)
A(N) =B(N)+B(N~1)+-"+B(2)+A(1)
S=A(1)
DO 100 K=2,64
S=S+B(K)
100 A(K)=S
The implementation of this decoupled Fortran program on the Tlliac-IV
requires the following machine instructions, based on the memory allocation
shown in Figure 6.{ 1.
WWW.Gitmgurgaon.blogspot.com
(1- iv + (a = (Dv OL
v9'‘Z=10 OG
Sweidoid aqy SUNNIINS ayNpA (CEYW) says Spo pee “HOM “VOR AG WCEP JO STHEIS {1-9 sindLy
rs
Wad Wad ‘wad “wad Wad Wad
LOZ UOH ROOF
a * 2 *
° ° o
o *
. s .
s * °
o . .
~ ~ _ — _ - T+2 ua}e20-7
y ee “9 *g @ tg iy PUOnROF
. rT
:
°
:
*
:
.
3
ca
: :
0 UOMBIO'T]
t ] ] i i i
Y i ‘ 1 f ‘
NO NO dio ddO ddO ado oo4
Sq 4 Sq 4 Seq 4 Og 'p 94 Oy Sg 4 Pg Oe 4 "97 4 Pg 4 Og Se Oy 904g BS 4 Sey 4 Oey "Sq you
ane
iS, as 65 O98,
G+ at at at le r+ le gs fp Pg gfHi. loaloa lait toate
9674 eg 4 7 4 Hg a+ + + V+ ’@+ + +o
t ln prt+eggt I
V+ gt g a Vou
c
ad "ad “ad “aa ‘a4 *a4
408
SIMD COMPUTERS AND PERFORMANCE ENHANCEMENT 409
Figure 6,11 shows the status of data in the PEM, the accumulator, the routing
register, and the mode-status registers from those enabled and disabled instructions
after step 8 is executed, when i = 2. Parallelism has been revealed after decoupling
in this example.
Most of the Iliac-IV system software is executed by the host processor B6500.
This system manager performs the standard B6500 operations, handles user-
seeking Hliac-I[V services, and implements the necessary features to support the
operation of the PE array, the disk-file system, and the I/O subsystem. The Illiac-IV
array can be visualized as the highest priority time-sharing user of the B@500 among
many users connected via the ARPA net. Results produced by the B6500 or by the
Illiac-IV programs may be printed locally or transmitted over the ARPA net to
remote output devices local to the user. The Hliac-IV array has a small resident
operating system executed by its CU which will allow the fast processing of traps
and other special loading operations.
, The [liac-TV operating system runs between a diagnostic mode and a nermal
mode. The main task of the diagnostic mode is the testing and diagnosis of possible
faults in the I/O subsystem and in the Illiac-IV array itself. The Illiac-IV operating
system consists of a set of asynchronous processes which run under the control of
the B6500 master-control program. The following events may take place when a
user submits an Hliac-IV job to the B6500:
1. The B6500 translates Algol or Fortran programs into binary input files to be
used by the Illiac-EV array processor.
2. The Tliac-I[V programs written in Ask, Glypnir, or Hliac-IV Fortran will
operate on the files prepared by the B6500 programs and prepare binary
output files.
3. The B6500 transforms the binary files from the Illiac-IV to the required external
form for use or storage.
4. An Illiac control-language program controls the operating system for the job
which tt defines.
The B6500 programs and the Hliac-IV programs communicate via the disk
files (for data) and via the 48-bit path for CU interrupt signals. The protocol for
WWW.Gitmgurgaon.blogspot.com
410 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
these signals over the 48-bit path is administered by two modules. The first is a
small executive program residing in the IIliac-TV itself (called OS4) which processes
all interrupts for the array, handles all communications between the user programs
and the rest of the operating system, and provides a few standard functions for use
in the array. The OS4 communicates with a medule (known as the job partner) in
the B6500, which acts as a clearing house for all communication between the O54
and thus the user program running on the Illiac-IV. The job partner thus initiates
all data transfers between the B6500 and the Iiliac-IV disk. This arrangement
emphasizes the B6500 as an I/O processor fer the Lliac-[V or, conversely, the
Hliac-1V as a peripheral processor for the B6500.
The Hliac-IV is very difficult to program properly if one does not banish
nearly all serial machine preconceptions and habits. It is worth pointing out the
differences between the Illiac-IV high-level languages and the existing languages:
1. The natural method of addressing PEMs is by rows of 64 words, since the words
of PEMs may be addressed in parallel. However, a column of words in one
PEM may not be addressed at once.
2. The vector elements are operated upon based on the mode pattern. The Illiac-IV
language should allow efficient manipulation of the mode patterns.
3. The Hliac language should allow reasonable expression of routing and indexing
independently in each PE.
The design experiences of the Illiac-1V are very useful in developing later SIMD
array processors. The performance of the Illiac-IV is about two to four times faster
than the CDC-7600. The Hliac-IV has limited scalar capability; it uses a recircu-
lating mesh network with fixed size. Some of these difficulties have been overcome
in later array processors like the BSP and the MPP. We shall discuss some per-
formance enhancement methods, including the skewed-memory allocations and
some language extensions, in Section 6.4.
Il Charge-coupled device
Central Input-output 1.5M bytes/second (CCD)
processor processor data and code files file memory
‘| (4 - 64M words)
75M bytes/second
Os by Parallel processor
BAS, Instruction
NG or control bs Main memory
We Ie memory (0,5 ~ 8M words)
OS (256K words)
Peripherals oN
5
Networks os!
My) Instruction Arithmetic
processor elements
WWW.Gitmgurgaon.blogspot.com
412 COMPUTER ARCHITECTURE AND) PARALLEL PROCESSING
' :
i Paratiel :
i processor :
: contrel }
: :
; : Control i processor;
i Control ; Scalar
Control ~*>->~
communications
75 Mbytes/sec
7 ¥
Peripherais
. |] :
ystem Manager : parallel . Fite in memory
and + B 7700/B 7800 memory (PMs) system
terminais medules (FM)
STO&SM
words} b
CLOCK ! FETCH
CLOCK 2 ALIGN FETCH
CLOCK 3 PROCESS ALIGN FETCH
CLOCK 4 ALIGN PROCESS ALIGN
CLOCK § STORE ALIGN PROCESS
CLOCK 6 STORE ALIGN
CLOCK 7 STORE
Figure6.13 Functional structure and pipelined processing inthe BSP, (Couresy of IEEE Trans. Computers,
Lawrie and Vara, 1982.)
the AEs for processing, and routed via the output alignment network into the
modules for storage. These steps are overlapped, as illustrated in Figure 6.13.
Note that the input alignment and the output alignment are physically in one
alignment network. The division shown here presents only a functional partition
of the pipeline stages. In addition to the spatial parallelism exhibited by the 16
AEs and the pipeline operations of the fetch, align, and store stages, the vector
operations in the AEs can overlap with the scalar processing in the scalar processor.
This results in a powerful and flexible system suitable for processing both long
and short vectors and isolated scalars as well.
Both alignment networks contain full crossbar switches as well as hardware
for broadcasting data to several destinations and for resolving conflicts if several
sources seek the same destination. This permits general-purpose interconnectivity
between the arithmetic array and the memory-storage modules. It is the combined
function of the memory-storage scheme and the alignment networks that supports
the conflict-free capabilities of the parallel memory. The cutput alignment network
is also used for interarithmetic element switching to support special functions such
as the data compress and expand operations and the fast Fourier transform
algorithm.
The file memory is a semiconductor secondary storage. It is loaded with the
BSP task files from the system manager. These tasks are then queued for execution
by the control processor. The file memory is the only peripheral device under the
direct control of the BSP; all other peripheral devices are controlled by the system
manager. Scratch files and output files produced during the execution of a BSP
program are also stored in the file memory before being passed to the system
manager for output to the user. The file memory is designed to have a high data-
transfer rate, which greatly alleviates the [/O-bound problem.
In summary, concurrent computations in the BSP are made possible by four
types of parallelism:
WWW.Gitmgurgaon.blogspot.com
414 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
the first approximations for the divide and the square root iterations. The floating-
point format is 48 bits long. It has 36 bits ofa significant fraction and 10 bits ofa
bimary exponent. This gives 11 decimal digits of precision. The AE has double-
length accumulators and double-length registers in key places. This permits the
direct implementation of double-precision operators in the hardware. The AE
also permits software implementations of triple-precision arithmetic operations.
It has been estimated that 20 to 40 megaflops could be achieved for a broad range
of Fortran computations in the BSP.
i= E (6.5)
Figure 6.14 shows a 4-by-5 matrix mapped columnwise into the memory of a
serial machine. For simplicity of illustration, assume N = 6 and M = 7 in the
hypothetical machine. (The BSP has N = 16 and M = 17.) The module and offset
calculations are shown in the illustration. The module number will remain con-
stant for a cycle equal to the number of AEs, then it will increment by one value.
The offset corresponds to repeated cycles of the same module with no value
repeating in one cycle and the length of the cycle equal to the number of memory
banks,
Example 6.2 As long as the number of AEs is less than or equa! to the number
of memory banks, the sequence of offset values will cause a different memory
bank to be connected to each AE. Thus, each AE may receive or send a unique
data object. The particular storage pattern produced in this six AE, seven
memory bank system for the 4-by-5 example array is shown in Figure 6.14.
The module and offset calculations for the second row of the array is explained
below. The starting address is | and the skip distance is ¢ = 4. We obtain the
following module numbers and address offsets:
ge = IGnod 7), 5(mad 7), Samed 7), 13¢mod 7), 17(mod. 7)
= 1, 3, 2, 6, 3 (6.6)
WWW.Gitmgurgaon.blogspot.com
446 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
A4x 5S matrix
G1, a yy 4 as
4) on a) Ga, 425
a 942 yy a4 455
0 | 2 3 4 5
OF ay | | ey | Sar | Sen | S22
7 8 9 10 11 6
Tay | ay | 3 | ay | My 32
14 15 16 17 12 13
2 as, | aay | ais | 35 4s | a
21 22 23 18 19 20
3 G35 | ys
ais 29 a4 25 26 ai
35 30 31 32 33 34
5
ee
Array elements 14:5] 914% 1] 1a} %32 [4a29%a2t a] 2 3| 293] ast jaf 3413] Pag] 5| 25] 5] Fas
little redundant memory bandwidth, since only one memory module is unused per
cycle. [t is clear that conflict-free access to one-dimensional! arrays is possible for
any arithmetic sequence index pattern except every 17th element. For two-
dimensional arrays with a skewing distance of four, conflict-free access is possible
for rows, columns, diagonals, back-diagonals, and cther common partitions,
including arithmetic-sequence indexing of these partitions. The method can be
extended to higher numbers of dimensions in a straightforward manner.
SIMD COMPUTERS AND PERFORMANCE ENHANCEMENT 417
The unused memory cells aré the result of having one less AE than there are
memory banks. However, the example case is not a useful occurrence. For the
real BSP with 16 AEs and 17 memory banks, division by 16 is much simpler and
faster than division by 17. One then pays the penalty of supplying some extra
memory to reach a given usable size. The above equations yield an AE-centrist
vantage point. As long as the same set of equations is always applied to the data
from the first time it comes in as I/O onward, then the storage pattern is completely
invisible to the user. This applies to program dumps as well because the hardware
always obeys the same rules.
Conflict does occur if the addresses are separated by an integer multiple of the
number of memory banks. In this case, all the values one wants are in the same
memory bank. For the BSP, this means that skip distances of 17, 34, 51, ete., should
be avoided. In practice, 51 is a likely problem skip. This is because it is the skip of
a forward diagonal ofa matrix with column length 50. If conflict occurs in the BSP,
the arithmetic is performed correctly, but at g the normal speed. The system
logs the occurrence of conflicts and their impact on the total running time. This
information is given to the programmer for corrective action if the impact was
significant.
DO 5 1=1,30
DO 5 J=7,25 (6.8)
5 X(,J) =(A(LJ+1)*#0.5+B(+1,J))
*X(LJ+1)+C(J)
would be compiled as a single vector form. This vector form can be regarded as a
six-address instruction that contains the four array arithmetic-operation specifi-
cations and the assignment operation.
WWW.Gitmgurgaon.blogspot.com
418 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
DO 3 1=1,25
3. Y(I)=F
(I) *¥(I-1) +G (1) (6.9)
has a righthand side that uses a result computed on the previous iteration. This
recurrence produces an array of results while others lead to a scalar result. For
example, a polynomial evaluation by Horner's rule leads to the following reduction:
P=C(Q)
DO 5 1=1,25
5 P=C(I)+Y#P, (6.10)
The third type of vector instructions involves various sparse-array operations.
For example, in the case of a Fortran variable with subscripted subscripts, ¢.g.,
A(B(D), no guarantee can be made concerning conflict-free access to the array A.
In this case, the indexing hardware generates a sequence of addresses that allows
access to one operand per clock. These are then processed in parallel in the arith-
metic elements, Such accesses are called random store and random fetch vector
forms.
Sparse arrays may be stored in memory in a compressed form and then
expanded to their natural array positions using the input-alignment network.
After processing, the results may be compressed for storage by the output-align-
ment network, These are called compressed vector operand and compressed vector
result vector forms and use control-bit vectors that are packed in such a way that
one 48-bit word is used for accesses to three 16-element vector slices.
The fourth class of vector instructions is used for I/O. Scalar and array assign-
ments are made to control memory and paralle] memory, depending on whether
they are to be processed in the scalar processor unit or the parallel processor,
respectively. However, it is occasionally necessary to transmit data back and
forth between these memories. Transmissions to file memory are standard I/O
types of operations, In Table 6.2, representative vector instructions in the BSP
are listed. These four types of vector instructions comprise the entire array functions
performed by the BSP.
In ordinary Fortran programs, it is possible to detect many array operations
that can easily be mapped into BSP vector instructions, This is accomplished in
the BSP compiler by a program called the Fortran vectorizer. We will not attempt
a complete description of the vectorizer here, but we will sketch its organization,
emphasizing a few key steps,
First, consider the generation of a program graph based on data dependencies.
Each assignment statement is represented by a graph node. Directed arcs are
drawn between nodes to indicate that one node is to be executed before another,
The BSP does a detailed subscript analysis and builds a high-quality graph with
few redundant ares, thereby leading to more array operations and fewer recur-
renices.
SIMD COMPUTERS AND PERFORMANCE ENHANCEMENT 419
Example 6.3 Consider the following program which explains the problem of
scoping and data dependence:
DO 5 1=1,25
A(l)=3*B(1)
DO 3 J=1,35 (6.11)
X(LJ)=A(I)
#X (1, J-1)+ CCN)
5 Bl) =2*B(1+1)
A dependence graph for this problem is shown in Figure 6.15a, where
nodes are numbered according to the statement label numbers cf the program.
Node | has an arc to node 3 because of the 4(/) dependence, and node 3 has
a self-loop because XU, J — 1) is used one J iteration after it is generated. The
crossed are from node | to node 5 is an antidependence arc indicating that
statement | must be executed before statement 5 to ensure that SU) on the
righthand side of statement | is an initial value and not one computed by
statement 5. Arcs from above denote initial values being supplied to each of
the three statements: array 8 to statements | and 5, and array C to statement 3.
The square brackets denote the scope of loop conirol for each of the DO
statements.
Given a data-dependence graph, loop control can be distributed down to
individual assignment statements or collections of statements with internal
loops of data dependence. In our example, there is one loop (containing just
one statement) and two individual assignment statements. After the distri-
bution of loop control, the graph of Figure 6.15a may be redrawn, as shown
in Figure 6.156, which can easily be mapped into BSP vector instructions.
Statements | and 5 go directly into array-expression vector forms since they
are both dyads.
DO 1 1=1,92,
DO J=1,46 (6.12)
1 IF(ACLS).LTO) BCLS) =ACLJ) #3.5.
WWW.Gitmgurgaon.blogspot.com
(a) A data dependence graph
420
SIMD COMPUTERS AND PERFORMANCE ENHANCEMENT 421
EXTENDED DYAD It accepts two vector set operands, (ZI, Z2) — AopB
does one operation, and produces
two vector set results.
DOUBLE PRECISION It accepts four vector set operands (Z1, Z2) = (AL, A2) op (BI, B2)
DYAD (ie., two double-precision
operands), performs one operation,
and produces two vector set results.
DUAL-DYAD It accepts four vector set operands, Z-Aop, BY —Cop,D
does 1wo operations, and produces
two vector set results.
TETRADI It accepts four set operands, does Z < (A op, B) op; C) op; D
three operations, and produces one
vector set result,
TETRAD2 It is similar to the TETRADI except Z (A op, B) op, (C op; D}
for the order of operations.
PENTADI Tt accepts five vector set operands, Z«((A op, B) op, C)op; D) op, E
does four operations, and
produces one vector set result.
PENTAD2 It is similar to the PENTADI except Z — ((A op, B) op, (Cop, D)) op,E
for the order of operations.
PENTAD3 It is similar to the PENTADI except Z< ((A op, B) op, Chop, (D op, E)
for the order of operations,
AMTM It is similar to the MONAD and is Zeopa
used to transmit from parallel
memory to control memory.
TMAM It accepts 6 vector set operands Z & A1(O, 0), A2(0, 0),
from centro] memory to transmit ARO, 0), A4(0, 0),
to parallel memory. AS(O, 0), A6(O, 0)
COMPRESS It accepts a vector set operand, X= A, BVO
compresses it under a bit vector
operand control, and produces a
vector set result.
EXPAND It accepts a vector operand, expands it under a bil vector XV, BVO
control, and produces a vector set result.
WWW.Gitmgurgaon.blogspot.com
422 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
MERGE It is the same as the EXPAND except that the vector set X + V, BVO
result elements corresponding to a zero bit in BV are
not changed in the parallel memory.
RANDOM FETCH lt performs the operation Z(j, k) — UC, k)), where U is
a vector and /is an index vector set.
RANDOM STORE It performs the operation XUG, k)) — AG, k), where X is a
vector and /is an index vector set.
REDUCTION It accepts one vector set operand and produces one vector
result given by
X() — Ai, 0) op ACL Lop Ali, 2) op AG, 3)... AG, DL),
where op must be a commutative and associative operator.
DOUBLE PRECISION 11 accepts two vector set operands (one double-precision
REDUCTION vector set) and produces two vector results (one dip.
vector) given by (X,(9, X2()) — (4,(, 0),
Anti, 0) op (410, 1), dali 1) op... (41, 2), dell, LD),
where op must be a commutative and associative operator.
GENERALIZED DOT It accepts two vector set operands and produces one vector
PRODUCE result given. by
XQ) = {AC 0) ops BU, 0} op, (AU, L) op, BE, Dt
op, ... (Al, Ly op, Bi, L)}, where op, tnust be a
commutative and associative operator,
RECURRENCE:
IL It accepts two vector set operands and produces one vector
result given by X() — ({...{(BU, 0) op, Adi, 1)
op; A, D} op,...}op, AC, £)) op, &G, L) where op,
can be ADD or JOR and op, can be MULT or AND.
PARTIAL It accepts one vector set operand and produces one vector
REDUCTION set result given by Z(4, ) — 2(i, J — l) op AG, /), where
op must be a commutative and associative operator.
RECURRENCE-IA Tt accepts two vector set operands and produces one vector
set result given by Z(i, /) — (Zi, 7 — 1) op, AC JD} op,
Bt. J). where op, can be MULT or AND and op, can be
ADD or IOR.
allow substantial speedups on the BSP. Of course, there is also a residual set of
IFs that must be compiled as serial code. Fortran language extensions have also
been made in the BSP to facilitate vector processing.
A large-scale SIMD array processor has been developed for processing satellite
imagery at the NASA Goddatd Space Flight Center. The computer has been
named massively parallel processor (MPP) because of the 128 x 128 = 16,384
microprocessors that can be used in parallel. The MPP can perform bit-slice
arithmetic computations over variable-length operands. The MPP has a micro-
programmable control unit which can be used to define a quite flexible instruction
SIMD COMPUTERS AND PERFORMANCE ENHANCEMENT 42.3
set for vector, scalar, and I/O operations. The MPP system is constructed entirely
with solid-state circuits, using microprocessor chips and bipolar RAMs.
> “<p
128 bit 128 bit
input output
Switches
Switches
interface interface
Array unit (ARU)
Control status
Contre status
Array control
unit (ACU)
Programs
data ontrol
Alpha
Magnetic Line
numeric
tape printer
terminal
External computer
Figure 6.16 The system architecture of the MPP system. (Courtesy of JEEE Trans. Computers, Batcher,
1986.)
WWW.Gitmgurgaon.blogspot.com
424 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Addition of arrays
8-bit integers (9-bit sum) 6553
12-bit integers (13-bit sum) 4428
32-bit floating-point numbers 430
Multiplication of arrays (element-by-element)
8-bit integers (16-bit product) 1861
12-bit integers (24-bit product) 910
32-bit floating-point numbers 216
Muttiplicaiion of array by scalar
8-bit integers (16-bit product) 2340
12-bit integers (24-bit product) 1260
32-bit floating-point numbers 373
PEs by the fan-out module. The corner-point module selects the 16 corner elements
from the array and routes them to the controller. The I/O registers transfer array
data to and from the 128-bit 1/O interfaces, the database machine, and the host
computer. Special hardware features of the array unit are summarized below:
| 4h A AO
a
Z 2| 41 Sa) she 32 t
8! 0 > = o = Corner
1
al 3| zie ee S|
point
decode
Corner
[ LOR | Fan-oul module register ¥
fe \ “ay if fio fish i PDMU data
& control bus
Resolve LOR
ADDR
Figure 6.17 The PE array and supporting devices of the array unit. (Courtesy of Goodyear Aerospace
Corp.)
WWW.Gitmgurgaon.blogspot.com
426 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Each PE in the array communicates with its nearest neighbor up, down, right,
and left-—the same routing topology used in the IIliac-IV. The ability to access
data in different directions can be used to reorient the arrays between the bit-plane
format of the array and the pixel format of the image. The edges of the array can
be left open to have a row of zeros enter from the left edge and move to the right
or to have the opposite edges wrap around. Since cases have been found where
open edges were preferred and other cases have been found where connected
edges were preferred, it was decided to make edge-connectivity a programmable
function.
A topology register in the array control unit defines the connections between
opposite edges of the PE array. The top and bottom edges can either be connected
or left open. The connectivity between the left and right edges has four states: open
(no connection), cylindrical (connect the left PE of each row to the right PE of the
same row), open spiral (for 1 <n < 127, the left PE of row n is connected to the
right PE of row n — 1), and closed spiral (like the open spiral, but also connects
the left PE of row 0 to the right PE of row 127). The spiral modes connect the 16,384
PEs together in a single linear-circuit list.
The PEs in the array are implemented with VLSI chips. Eight PEs are arranged
ina 2 x 4 subarray on a single chip. The PE array is divided into 33 groups, with
each group containing 128 rows and 4 columns of PEs. Each group has an
independent group-disable control line from the array controller. When a group
is disabled, all its outputs are disabled and the groups on either side of it are joined
together with 128 bypass gates in the routing network.
Sum carry
full adder
N-bit
shift REG.
{(N=2,6,10,14
18,22,26, or 30
To SUM-OR tree
From PE To PE Random-access
Address
On west on east memory
Figure 6.18 Functional structure of a processing element (PE) in the MPP. (Courtesy of /EEE Trans.
Computers, Batcher, September 1986.)
WWW.Gitmgurgaon.blogspot.com
428 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
The random-access memory stores up to 1024 bits per PE. Standard RAM
chips are available to expand the memory planes, Parity checking is used to detect
memory faults. A parity bit is added to the eight data bits of each 2 x 4 subarray
of PEs. Parity bits are generated and stored for each memory-write cycle and
checked when the memories are read. A parity error sets an error flip-flop asso-
ciated with each 2 = 4 subarray. A tree of logic elements gives the array controller
an inclusive-OR ofall error flip-flops. By operating the group-disable control lines,
the controller can locate the group containing the error and disable it.
Standard 4 x 1024 RAM chips are used for the PE memories. As shown in
Figure 6.19,2 x 4 subarrays of PEs are packaged on a custom VLSI CMOS-SOS
chip. The VLSI chip also contaims the parity tree and the bypass gates for the
subarray. Each printed circuit board contains 192 PEs inan 8 x 24 array, Sixteen
boards make up an array slice of 128 x 24 PEs. Five array slices (80 boards) make
up the bulk of the entire PE array. The remaining 12 PE columns are packaged on
16 1/O-processor boards. which also contain the topology switches, the I/O
switches, and the L/O interface registers. The 96 boards of the array are packaged
in one cabinet with forced-air cooling.
Like the control unit of other array processors, the array controller of the
MPP performs scalar arithmetic and controls the operation of the PEs. It has
three sections that can operate in parallel, as depicted in Figure 6.20. The PE
North neighbors
E w
a e
$ S
i t
a n
e e
i i
B 8
h h
6 b
o o
C cC
e 5
South neighbors
Figure 6.19 The interconnection of VLS] PE and RAM chips in the MPP array. (Courtesy of Goodyear
Aerospace Corp. 1980.) :
SIMD COMPUTERS AND PERFORMANCE ENHANCEMENT 429
PE PE
PDMU——; contra] control ponte ARU
memory unit
Queue
~ Main
contro]
wait
Main
PDMU control |
memory
“1 va
control eit ARR|)
unit
control performs all array arithmetic in the application program. The I/O control
manages the flow of data in and out of the array. The main contro! performs all
scalar arithmetic of the application program. This arrangement allows array
arithmetic, scalar arithmetic, and input-output to take place concurrently.
The PE control generates all array control signals except those associated with
the I/O. It contains a 64-bit common register to hold scalars and eight 16-bit index
registers to hold the addresses of bit planes in the PE memory elements to count
loop executions and to hold the index of a bit in the common register. The PE
control reads 64-bit-wide microinstructions from the PE control memory.
Mest instructions are read and executed in 100 ns. One instruction can perform
several PE operations, manipulate any number of index registers, and branch
conditionally. This reduces the overhead significantly so that PE processing power
is not wasted.
The PE control memory contains a number of system routines and user-written
routines to operate on arrays of data in the array. The routines include both array-
to-array and scalar-to-array arithmetic operations. A queue between the PE
control and the main control queues up to seven calls to the PE control routines.
Each call contains up to eight initial index-register values and up to 64 bits of
scalar information, Some routines extract scalar information from the array (such
as a maximum value) and return it to the main control.
The I/O control shifts the $ registers in the array, manages the flow of informa-
tion in and out of the array ports, and interrupts PE control momentarily to
WWW.Gitmgurgaon.blogspot.com
430 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
transfer data between the $ registers and buffer areas in the PE memory elements.
Once initiated by the main control, the 1/O control can chain through a number
of 1/0 commands. The main control is a fast scalar processor which reads and
executes the application program in the main control memory. It performs all
scalar arithmetic itself and places all array arithmetic operations on the PE
control call queue,
The MPP being delivered to NASA uses a DEC VAX-11/780 computer as
the host. The interface to the host has two links: a high-speed data link and a
control link. The high-speed data link connects the 1/O interface registers of the
MPP to a DR-780 high-speed user interface of the VAX-11/780, Data can be
transferred at the rate of 6 megabytes/s. The control link is the standard DECNET
link between a PDP-il and a VAX-11/780. The DECNET hardware and software
allow the VAX users to transfer their program requests to the MPP from remote
stations.
Control
unit
1
Staging memory
High-speed 1/0 Figure 6,21 The staging memory concept in the MPP.
SIMD COMPUTERS AND PERFORMANCE ENHANCEMENT 43]
capable of transferring 320M bytes/s. The staging memory acts as a data buffer
between the parallel and the outside world. The staging memory can accept data
from conventional and special-purpose peripherals at rates up to 320M bytes/s.
Its internal controller allows it to pack and reformat data so that the parallel
array can more efficiently process them.
The staging memory can randomly address any individual datum but, for any
given address, it fetches a block of 16K data elements and sends it to the array,
where each microprocessor memory receives one datum. The exact configuration
of the biock of data fetched by the staging memory is under program control. For
example, if k is the specific address, data are fetched from the following addresses:
k, k+n, k + 2n, ...K
+ 127n,
k+m, k+actm, k + 20 + m, .. K+ 127n + m,
Instruction set of the MPP The instruction set for the MPP can be divided into
three subsets: sequential, parallel, and interface. The sequential instructions are
similar to those of any other sequential computer. They consist of load, store, add,
subtract, compare, branch, logical operations, etc. Executed by the sequential
controller alone, these instructions are used primarily to direct program flow and
to calculate individual parameters and constants that will be broadcast to the
parallel array.
The parallel instructions, also similar to conventional sequential instruction
sets, consist of load, store, add, subtract, compare, and logical operations, but not
branch. The parallel instructions are stored in the sequential controller’s memory
$12 columa
(x, ¥)
N18
A block
128] of 16K
512 pixels
rows
kom yeS12 + x
n=]
ma 312
Figure 6.22 Parallel access to the staging memory.
(Courtesy of /EEE Computer, Potter, January 1983.)
WWW.Gitmgurgaon.blogspot.com
432 COMPUTER ARCHITECTURE ANI? PARALLEL PROCESSING
intermingled with the other instructions. When the sequential controller detects a
parallel instruction, it passes it via the interface registers to the parallel array,
where it is executed by all 16K processors simultaneously. This fundamental form
of parallelism provides the incredible computing power of the MPP.
The MPP, like most SIMD processors, has a set of interface instructions that
allows the movement of data between the sequential and parallel portions. Con-
stants and parameters can be broadcast to each of the parallel processors by the
sequential controller with a special version of the parallel instructions. But the
inverse operation—moving parallel resulis io the sequential portion---can be
more complex. The key is the ability to select a unique processor of the 16K to be
active.
Each of the PEs is assigned a unique identification number. The STEP instruc-
tion selects the lowest-numbered PE to be active and thus enables it to communi-
cate with the sequential controller. The STEP instruction can be combined with a
previously executed comparison or other logical operation so that only those PEs
satisfying the logical conditions are involved in the operation.
The STEP instruction can also be combined with other instructions to se-
quence through specified subsets of PEs. This allows the data from each PE to be
processed in turn by the sequential controller and by the PE array that is under
program control. Described below are several planned image processing applica-
tions of the MPP. Performance results on the MPP were not available at the time
this book was produced,
Syntactic pattern analysis In addition to feature extraction, the parallel array can
be used very effectively to guide linguistic techniques. In general, these techniques
consist of a large number of production or reduction rules that must be selectively
applied, If one rule is stored in each PE, then 16K rules can be updated and searched
in parallel without being ordered.
SIMD COMPUTERS AND PERFORMANCE ENHANCEMENT 433
i 7 7
Micro- 2, 1
processor J. Memory for microprocessor
7 2, 1
Micro- J /
ssor 1, T J T
Processor I, | Memory for microprocessor 1, 1 /
1 \
Micro- T T
processor 1, 2 Memory for microprocessor 1, 2
Figure 6.23 Image data storage in the bit-plane addressable MPP memory system. (Courtesy of (EEE
Comparer, Potter, January 1983.)
In the MPP and other SIMD processors, a different rule can be assigned to
each PE. Consequently, when a feature is found, all of the rules can be considered
in parallel to determine those that apply; the STEP instruction can then be used
to sequence through the rules that require processing. Since the addition and
deletion of rules in the array memory is a simple operation requiring no sorting,
data packing, or garbage collection, this approach will be extremely useful in
situations where the grammar is undergoing modification.
WWW.Gitmgurgaon.blogspot.com
434 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
East/west
communication
path
128 processing
elements
North/south
communication
WWW.Gitmgurgaon.blogspot.com
436 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
This ratio means that there is no guarantee we will find one or a limited number of
allocation schemes that will allow us to access an arbitrary M vector in a single
memory cycle.
Define d as the distance (mod M) measured in columns between adjacent
elements ofa subarray in the P x Q space. For example, the elements ofa row or
of a diagonal have d = |, the elements of a column have ¢ = 0, and the even-
subscripted elements ofa row or diagonal have d = 2. 1fQ@ < M, stored arrays can
be mapped into memory modules by placing the adjacent elements of rows in the
same relative locations of adjacent memory modules (i.c., rows across the memory).
ifQ > M, the above mapping can be performed on partitions of Q.
Furthermore, we define s, the skewing degree of a storage scheme, as the dis-
tance measured in columns that each row has been shifted with respect to the
row above it. All shifting must be done mod M. One value of s is used for an entire
array. Figure 6.26 illustrates the storage of an array with skewing degrees 5 = 0
and s = 1.
The elements of an M vector fetched in one memory cycle will have the same
order they had in matrix space if, and only if, 1 = d + (j — l)s(mod M) for all
i <j, where j and j are the subsequent rows in P x @ space from which elements
are fetched. Let j--i=r. Note that d ++ « s(mod M) is the displacement
between successive elements to be fetched. If this is not one, the elements are not
in successive memory modules, For each stored M vector, we define an ordered
set as
S={kll<k<M.1<si<M (6.14)
The ith element of the M vector is stored in memory module k;. Let s' = d +r x
s{mod M) be the displacement between successive M-vector elements. If an M
vector is relatively prime to s’, then we can access the M vector in one memory
cycle. It follows that if some stored M vector is not relatively prime to g, then M/g
elements may be accessed at once and g fetches are required to complete the access
of that vector.
If we restrict ourselves to M = 2" for some integer L, we can access rows and
diagonals with s = Qor rows and columns with s = 1. However, ifs = landd = 1
for diagonals and d + s = 2, we cannot access diagonals (Figure 6.25) in parallel.
In fact, for any s,s or s + 1 must be even. This proves the impossibility of fetching
rows, columns, and diagonals using one storage scheme when M is an even integer.
With the following nonuniform skewing scheme, it is possible to access rows,
columns, and some square blocks. Suppose we choose t = JM + 6{mod M),
where 3, = 1 ifrow index i = k,/M + I fork > 1, and 3, = 0 otherwise. Within a
strip of width ./M, we can clearly access a ./M x ,/M block in one cycle.
Across these strip boundaries, conflicts arise in fetching square blocks. However,
due to the additional skewing by one at strip boundaries, it is possible to access
columns because ¢ is relatively prime to M, as shown in Figure 6.26. Sometimes
an access to rows or columns of a square block may be desired. Then blocks may
be regarded as one memory access and the above method can be applied. This
SIMD COMPUTERS AND PERFORMANCE ENHANCEMENT 437
Memory module
Address 1 2 3 4
I ay a, a4
(a)
Memory module
Address 1 2 3 4
2 O44 as, ay
3 B34 ay G9
4 M49 Gy ay
WWW.Gitmgurgaon.blogspot.com
438 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
0 o 10 2 30 x
| 4 6070s
2 23 x» oh Ul
3 n x 41 Sl 61
4 x @ 12 2 32
5 42 52 62 72 x
6 30°23 33k OB
7 63 73 x 8 33
8 440C(x OL
9 x 4 4 6 74
10 o> 15 28 35 x
i $5 65 75 x 48
12 2% 360k 6sd'G
13 76 x 460 5666
14 x OF 17 2
15 47 57 67 Fx
also interesting, being an odd power of two minus one and not much larger than a
perfect square.
A storage scheme is a set of rules which determines the module number and
address within that module where a given array element is stored. We will restrict
our attention to two-dimensional arrays. However, generalization of these storage
schemes is simple for higher-dimensioned arrays. Described below is the storage
scheme successfully developed by Lawrie and Vora (1982) for the BSP.
Example 6.4 Table 6.4 shows an 8 x 8 array stored in five memory modules
using column major storage. Any five consecutive elements ofa row, column,
diagonal, etc., lie in separate modules and thus can be accessed in parallel
without conflict. For example, the second through sixth elements of the first
row are stored in module numbers 3, 1, 4, 2, 0, and at addresses 2, 4, 6, 8, 10,
respectively.
a(x) = g(ax + b, cx + e)
_ [ex + B]
(6.19)
~ P
it is easy to show that ifr is relatively prime to the number of memory medules, then
access to the P vector can be made without memory conflict.
Since it is most convenient to generate the address o(x) in memory p(x), we
solve for x in terms of pz and get
x(#) = [G4 — B)r’] mod M (6.20)
where + x r’ = 1 mod M. Substituting this into Eq. 6.19, we get
{(a + Iu — By mod M} + 8 + ef + base}
ae(ps) = E
Example 6.5 Consider the P vector V(0, 0, 1, 1), i.¢.. the second through sixth
elements of the first row of A(& x 8). We have & = Sand d = 8; thus
ulx) = [(x) x 8 + 8] mod 5
((x) x 8+ 8]
atx) =
4
Sincer’ = 2(i¢.,2 x 8 = I mod 5),B = 8,andM = §, the following addresses
are obtained as shown in Table 6.4:
n(x) = G, 1, 4, 2,0)
a(x) = (2, 4, 6, 8, 10)
api) = (10, 4, 8, 2, 6)
The proper addresses in memories My, M,,..., M4 are 10, 4, 8, 2, 6, respec-
tively.
We use the s(x) equation in the xth processor to determine the module number
of the memory containing the xth element of the desired N vector. At the same time,
addressing hardware in memory « uses the «() equation to determine the necessary
address of the desired element. We use 2(j:) instead of a(x) because this eliminates
the need to route the addresses from the processers through the switch.
The design of the access conflict-free memory is based on the use of a prime
number of memories, Crucial to this design is the simplification of the offset
equations. Most of the mod M operations and offset calculations can be done with
ROMs or with some indexing hardware. The design of this memory system fits
nicely in the context of the BSP. The indexing hardware carries out the necessary
addressing and alignment calculations automatically once the initial vector-set
descriptors have been set up. The problem of indexing overhead and memory-
access conflicts may seriously deteriorate the system performance if not properly
controlled.
each of the PEs, When a bit of the mask pattern is true, the corresponding PE is
enabled and may thus deliver the results of an operation. Consider the Glypnir
expression
will deliver the maximum elements of A and B to C and may result in both the
then and else statements to be executed.
Blocks of assembler language can be explicitly embedded in a Glypnir program
for the optimization of any section of code. It has also facilities to refer to selected
hardware registers for lower-level code optimization. However, the language
demands that the programmer undertakes the detailed supervision of storage
allocation and be constrained te only Illiac-[V rows (64 components) or vectors
of rows. To remove these restrictions, the Mliac-[V Fortran allows the user to
program with vectors of any length in either “straight” or “skewed” storage
allocations. Skewed allocation allows equal accessibility of rows or columns in
an array.
in Illiac-FV Fortran, the binary data type can be used to specify bit-control
vectors for masking purpose. The DO statement has been extended to allow parallel
execution of arithmetic expressions, and extra constructs have been added to the
language to allow the shifting and rotation of vectors and array rows. The only
significant change are the EQUIVALENCE and COMMON statements, where
the two-dimensional STORE of Illiac-1V imposes reactions on the usual serial
definition.
A parallel-processing programming language Actus has been introduced by
R. H. Perrott (1979) for array processors, Most parallel computers use extensions
of existing languages, such as the extended Fortran for the Star-100, the CFT for
Cray-1, and Glypnir language for the [lliac-IV. The language SL-1 is one of the few
languages that has tried to bring some of the benefits of structured programming
to the Star-100 system. More recently, Vecrran has been developed by the IBM
research group to facilitate the application of vector-array processing algorithms.
Actus offers a theoretical extension of the language Pascal. Actus attempts to
redress the technology imbalance between hardware and software development
WWW.Gitmgurgaon.blogspot.com
442 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
1. The shift operator, which causes movement of the data within the range of the
declared extent of parallelism
2, The retate operator, which causes the data to be shifted circularly with respect
to the extent of parallelism
WWW.Gitmgurgaon.blogspot.com
444 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
ali the elements of a are incremented by one until none of the elements of a are
less than their corresponding element in b.
Functions and procedures can be declared using the data declarations and
statements previously defined; the maximum extent of parallelism of all variables
must be known at compile time. The Pascal scope rules for procedures and func-
tions apply; hence, local variables cannot have their extent of parallelism altered
by a fiction or procedure call.
The formal parameter list for both functions and procedures was expanded to
allow for parameters which are parallel variables. The actual parameters can then
be either of the same extent of parallelism or a section of the same extent ofa larger
parallel variable. Only procedures and functions involving scalar variables may
be parameters. In the case of a function, either a scalar or parallel variable can be
returned as a result of its execution; the extent of parallelism can be different from
that of the parameter(s). Procedures can be used to return one or more results
which can be either scalar or parallel variables or a mixture of both.
The features of Actus have been described using a syntax similar to that of
Pascal; this was due to a plan to use an existing Pascal compiler for its implementa-
tion at the Institute of Advanced Computation, NASA/Ames Research Center in
California. The Pascal P compiler was used in the creation of the Actus compiler.
This P compiler is being modified and enhanced with the new features to form an
Actus P compiler which also generates code for a hypothetical stack computer.
Since this code is machine independent, the Actus P compiler can be used as a
basis for the implementation of Actus on other parallel machines. Preliminary
results of the implementation indicate that the features of Actus can be mapped
onto the instruction set of the Illiac-IV.
Another consideration in implementing the Actus language is to automate the
management of the memory. It is important to determine either from the user or
by the compiler the size of the working set; the working set is the minimum amount
of the database required to be resident in the fast store so that processing can
continue without excessive interruptions. On the basis of such information, the
fast store can be divided into buffers and processing can be overlapped with
backing store transfers. Thus, the compiler rather than the user is responsible for
the organization of data transfers,
In summary, two research objectives were achieved by the Actus language and
by its compiler. The first objective was achieved by introducing the concept of
the extent of parallelism, whose maximum size is defined in the data declarations
SIMD COMPUTERS AND PERFORMANCE ENHANCEMENT 445
and subsequently manipulated (in parallel) either in part or in total by the state-
ments and constructs of the language. Using this concept, it was found to be
possible to adapt a unified approach for both types of computers. The second
objective was achieved by modifying existing data and program-structuring con-
structs of Pascal to accommodate the special demands of a parallel environment.
Note that this list of performance parameters is very similar to those for
pipeline computers studied in Section 4,5,4. The m PEs correspond to the k pipeline
segments. In fact, the vector job parameters n and N; are the same in analyzing both
types of vector processors. The instruction execution time ¢, of a PE is assumed to
be a constant equal to the average time for typical instructions. Figure 6.27 shows
the space-time diagram for the execution of the ith vector instruction on an array
processor with m= 3 PEs, where N,; = 10 operands are contained in the ith
vector instruction, The PE arrays are used four times to complete the execution
of the 10 operations. During the fourth iteration, two PEs are disabled.
In an array processor, the time required to finish the ith instruction is com-
puted by
N,
t= E / “ta (6.31)
WWW.Gitmgurgaon.blogspot.com
446 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
PE, 4
WM Is
Ui, LYLE! 7
Me
Figure 6.27 The space-time diagram for an array processor with 3 PEs,
T= i=Yuet
1 i=1 |
| (6.32)
The same job if executed on a serial SISD computer requires
EA (6.34)
¥ [Ni/m]
i= 1
(6.35)
This shows that the efficiency d corresponds to the ratio of the actual speedup S,,
to the maximum speedup m. In the ideal case @ > 1, when S,, > m.
Example 6.6 To illustrate the above performance measures, we choose the
same vector distribution as in Figure 4.39. The mean vector length of the job
distribution is 4.4. On the average, 4.4 PEs are needed to execute a vector
instruction among a set of 10 instructions. Similar to that plotted in Figure
4.39, we calculate the efficiency (using Eq. 6.35) and the speedup (Eq. 6.34) of
SIMD COMPUTERS AND PERFORMANCE ENHANCEMENT 447
an array processor having m PEs (1 < m < 10). The numerical results are
plotted in Figure 6.28. The speedup increases monotonically with respect to
the number of available PEs, whereas the efficiency declines with the increase
of PEs,
For an array processor, not only the vector length but also the residue of the
vector length will affect the system performance. For example, to execute a vector
instruction of length 65 on the Iiliac-I'V computer with 64 PEs requires one
additional instruction cycle to execute the residue of one component operation.
The system performance will be degraded due to small residues. Degradation
results mainly from many idle PEs. Figure 6.29 shows the speedup (utilization)
against vector length on an array processor with eight PEs. Here we consider only
the execution of a single vector instruction (n = 1). The maximum speedup is
achieved when N, is a multiple of m = 8.
For small residues, the speedup drops rapidly. When N, approaches infinity,
the ill effect of residue becomes less severe. The PE utilization will approach one
when the vector length goes to infinity, as shown by the envelope of the saw-teoth
curve. Parallel algorithms, good memory-allocation schemes, and optimizing
compilers are needed in array processors. Whether a large array processor will
perform as ideally projected depends heavily on the skill of the users, The failure
of promoting the IIliac-FV and the BSP into the commercial market was mainly
due to user reluctance in accepting SIMD computers for general-purpose appli-
cations.
Utization
Speedup
Figure 6.28 Efficiency and speedup ot an array processor with various number of PEs.
WWW.Gitmgurgaon.blogspot.com
448 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
8.000]
(1.0)
6.400
(0.8) t
Se
28
=
(utilization)
Speedup
3.200|_
(0.4) :
1.600 :
(0.2) L
: |
ot i i i i | L
0G 8 16 24 32 40 48
Vector operand Jength
Figure 6.29 The speedup (utilization) of an 8-PE array versus vector length.
M: memory
| P M Ke
Bus T
[ t —1
P P PE, [F] PE, |
PE, eee '
M M M
' 1/o fo
fn nee eenes ene feensee cae ncse see pessees oe senate tnes cece neesaenns
cu,} | P| M vee P cU,, :
{ |
m-by-r interconnection network |
Heat|
Jeet
PE, PE, *e @ PE 7
'eee,
WWW.Gitmgurgaon.blogspot.com
450) COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
cu cu
Array, Array, m
- 64 PEs ~ - 64 PEs
cu cu
_ Array, - _ Atray,
64 PEs - 64 PEs
f
eh 1/0 switch CU: control unit
Real time link
Parallel General
a pe purpose
aecess
computer
diskK
B-6500
To peripherals
and computer tet
PE : 7 at E
0 63: 27! 191: 255
Four quadrant arrays
[BE ! PE | : PE
0 63; 127 0 63 127
Two quadrant arrays
[BE PE [BE PE PE PE PE PE
0 63 0 63 Q 63 0 63
Single quadrant asrays
(6) Multiconfigurations
Figure 6.31 The original Iliac JV design and muitiple configurations. (Courtesy of HEEE Trans. Com-
puters, Barnes, et al., August 1968.)
C6L61 dung “904g DIN SdLAY 30 4saqun0g) ‘wopEzTuTTie amdwo? GIA HA/GINIS-nmD Ad AL 729 nBLy
AOA 3]L.]
O
451
(NOW) HUH joju0d Az0wWaUl ayy
AJOWSUL poses
PO ro» “ows
WWW.Gitmgurgaon.blogspot.com
rs Sows
(aaa| [®auc |
(NIA dd) #20M fou GOTDUUOSOyU! AGUS -1OSS3901
T T NWAWS
| srq WA
-
'
‘i
:
:‘:
:‘
'‘
'
i
€
‘
\
‘
t
‘'i
a me
= — |
ne semen ne | me mene
wa
43} jg ysanbai auy au
3
-
PRD A eve *nDA yUN juatladeuew Arodiau paieys “WINS
penne enn
omau VOUES|UNE UO? 1085390IdI91U] *NOdI
ao erence cen ee eme ee enn eee ee ee ennee a nen esos meena e acca nccenncnsencanapeesonetectamacesannnnssed Hun Aroiwaw Jossa.01d :nWd
sng W iA
HUR JOIJUOD JOA
cdN
FOSSIIO.HE IOUUOUT
dW
452 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
The shared-resource
pool of r PEs
Figere 6.33 The MSIMD model with m control units (CUs) and a shared-resource pool of r Processing
Elements (PEs). (Courtesy of ZEEE Trans. Computers, Hwang and Ni, September 1980).
SIMD COMPUTERS ANID PERFORMANCE ENHANCEMENT 453
dynamically partitionable into subarrays of various sizes. The PM* was proposed
to operate in this fashion,
Several research issues of MSIMD computers are identified below. The goal
is to design MSIMD computers for multiple array processing in an interactive
manner. These design issues must be properly addressed in order to achieve high
performance at reasonable system cost. The operating system of an MSIMD
computer is much more complicated than that of a single SIMD machine.
WWW.Gitmgurgaon.blogspot.com
454 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
The original SIMD concept can be traced to Unger (1958) and Slotnick, et al.
(1962). The structure of the original 256-PE Tliac-IV computer was first reported
in Barnes, et al. (1968). The Iliac-IV software and applications programming was
reported by Kuck (1968). A more recent report of the Illiac-IV system was given
in Bouknight, et al. (1972). The Actus language extensions for array processors
are based on the work of Perrott (1979). Assessments of the first-generation vector
preces-ors, including the Illiac-IV, were given by Stokes (1977) and Theis (1974).
The language Glypnir was described in Lawrie, et al. (1975). The Fortran-like
language CFD was reported by Stevens (1975), Programming experiences on the
Hliac-IV are summarized in Stevenson (1980). Possible extension of the Illiac-IV
to the Phoenix array processor was discussed in Feierbach and Stevenson (1978).
The Burroughs Corporation has published a series of technical notes on the
BSP architecture (1978a, b,c). The BSP was comprehensively reported in Kuck
and Stekes (1982}. Arithmetic design of the BSP was reported in Gajski and
Rubinfield (1978). The prime memory system is based on the work of Lawrie and
Vora (1982). The MPP has been reported in Batcher (1980). Detailed design
features of the MPP are described in a final report by Goodyear Aerospace
Corporation (1979), The image processing applications of the MPP are reported
by Potter (1983). The skewed allocation of parallel memories is based on the work
of Budnik and Kuck (1971). A good summary of BSP features is given in Kozdro-
wicki and Theis (1980), where comparisons of the BSP to Cyber-205 and Cray-1
are provided. The throughput analysis of array processors is based on the compara-
tive study by Hwang, et al. (1981). Multiple SIMD computer organizations are
modeled in Hwang and Ni (1979, 1980). The partitioning of permutation network
for MSIMD machine has been studied in Siegel (1980).
Problems
6.1 Explain the following system features associated with the Iliac-IV, the BSP, and the MPP artay
Processors,
(a) Multi-array configurations of the Iliac-IV
(h) The prime memory for the BSP
(c) The bit-slice operations in the MPP
(@) Concurrent scalat-array operations in the BSP
(ec) Concurrent 1/0 and arithmetic logic operations in the MPP array
(/) The staging memory configurations in the MPP
(9) Host computers for the Hliac-IV, the BSP, and the MPP
(A) The I/O facilities in the Iliac-T¥, the BSP, and the MPP
6.2 Prove that the Iliac recirculating network cannot be partitioned into independent subnetworks,
each of which would have the properties ofa complete Illiac network. (Hint: Use the rotating functions
defined in Section 5.2.2.
6.3 Devise an SIMD algorithm for finding the inverse of an 8 x 8 triangular matrix 4 = (a;,) on the
Illiac-1V computer with 64 PEs. Show the memory allocation and CU instructions needed 10 implement
the algorithm. Minimizing the number of instruction steps and the required data memory words is the
primary design goal. The given mattix 4 is assumed to be nonsingular. The successive contents of the
involved PE registers and of the PEMs should be demonstrated along with the instruction steps.
Masking can be used to enable and disable the PEs.
SIMD COMPUTERS AND PERFORMANCE ENHANCEMENT 485
6.4 Using the vector instruction forms specified in Table 6.2, devise a BSP program for solving a linear
triangular system of algebraic equations with aunknowns. Youcanassumen » p = 16, We number of
arithmetic elements in the BSP, A “‘ block ” back substitution method is suggested to solve the triangular
system. Memory allocation for the characteristic mattix must be specified among the m = 1? memary
modules.
6.5 Given ann x m image, the gray level for each pixel (picture element) is between 0 and 6 — L. Let
A{i,/] denote the gray level at the pixel Gi, /). An algorithm to construct a histogram in an SISD com-
puter is shown below:
Suppose we want to use an SIMD machine with p PEs io construct the hisiogram. Assume
nim > pon, mt, and p(p = 2) are powers of 2, and m/p = & is an integer, Each PEM stores & rows of
image data, e.g, PEM, stores rows one to &, etc, The storage formats are shown in Figure 6.34, where
‘
° ‘ .
e
es : e
:
«
°
. e e
456
SIMD COMPUTERS AND PERFORMANCE ENHANCEMENT 457
the memory locations from «# + & — | in each PEM are used to store a local histogram. The method
to construct the histogram consists of forming a local histogram in each PE and then combining local
histograms to get a global histogram.
The organization of each PE, is shown in Figure 5.2. Each PE, can communicate with its neigh-
boring PE;, , and PE,_, in one routing step. You are allowed to use the vector instructions and scalar
instructions listed in Table 6.5. The number below each instruction indicates the instruction cycle time,
Also, we assume that there are five global index registers in the contral unit,
{a) Using these instructions, write a program to construct a histogram in the SIMD machine. The
resulting data should be stored in the memory locations «,¢ + |,...,% + 6 — lin PEM,.
(6) Compute the total number of cycles required in your program. What is the speedup of your pro-
gram over a conventional one in an SISD computer, which consists of the CU and one PE? Assume the
scalar counterpart of cach vector instruction consumes the same time as each vector instruction. Note
that there is no communication cost in an SISD computer.
6.6 Consider the evaluation of the following imner-product expression in an SISD machine with one
PE or in an SIMD machine with m PEs connected by a linear circular ring:
S= A. i -B i (6.36)
t he
iz]
It is assumed that each ADD operation requires two time units and each MULTIPLY operation.
requires four time units. Data shifting along the bidirectional ring between any adjacent PEs requires
one time ynit,
(a) What is the evaluation time ofS on the SISD computer?
(6) What is the evaluation time of5 on the SIMD computer?
{c) What is the speedup of using the SIMD machine over the SISD machine for the evaluation
of S?
6.7 Consider K couples of vectors. The ith couple consists ofa row vector &, and a column vector C),
each of dimension NV = 2". To compute the pairwise inner product for the ith couple, we perform
the following:
N
miys= ¥ Rf+ O04 (6.37)
gel
For i + | to Kdo
begin
IP[i] —0;
For j+-1 to N do
IP{i]—IPG) +R) *C.B1;
end
(a) Neglecting the initialization step, index updating and testing, find the total compute time on
a uniprocessor as a function of K and N. Assume that multiplication and addition take the same unit
time to complete.
(4) To speedup this computation, an SIMD machine can be used by exploiting the parallelism in
the computation. Two different implications are suggested below. Find the compute time in each case.
(i) Use P = N processing elements (PEs) to compute JP[#] successively for each couple of
vectors R,, C,.
(ii) A couple of vectors are allocated to each PE, which computes one inner product. The number
of PEs is P = Kin this case.
68 Consider an SIMD machine with 256 PEs using a perfect shuffle imterconnection newark, If the
shuffle interconnection function is executed 10 umes, where will the data item originally in PE,3;
be located?
WWW.Gitmgurgaon.blogspot.com
458 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
6.9 We have learned that an SIMD machine with P = 27" PEs can access without conflict the rows,
columns, diagonal, and reverse diagonal ofa matrix from Mf = 27" +. 1 parallel memory modules, if
the skewing distance S = 2”. Prove that it is also possible to access any 2"-by-2" square block in one
memory cycle under the same condition.
6.10 Table 6.4 shows the skewed memory allocation for an 8 x & matrix in an array processor with
M = 5 memory modules and P = 4 processors,
(a) List all patterns that can be accessed in one memory cycle.
(6) Give a P vector (1, 1, 1, 1). Calculate the word addresses in the memory modules involved.
CHAPTER
CO- 4 SEVEN
MULTIPROCESSOR ARCHITECTURE AND
PROGRAMMING
with each other. An example of a multiple computer system is the IBM Attached
Support Processor System. A multiprocessor system is controlled by one operating
system which provides interaction between processors and their programs at the
process, data set, and data element levels. An example is the Denelcor’s HEP
system.
Two different sets of architectural models for a multiprocessor are described
below. One is a tightly coupled multiprocessor and the other is a loosely coupled
multiprocessor. Tightly coupled multiprocessors communicate through a shared
main memory. Hence the rate at which data can communicate from one processor
to the other is on the order of the bandwidth of the memory. A small local memory
or high-speed buffer (cache) may exist in each processor. A complete connectivity
exists between the processors and memory. This connectivity can be accomplished
either by inserting an interconnection network between the processors and the
memory or by a multiported memory. One of the limiting factors to the expansion,
of a tightly coupled system is the performance degradation due to memory con-
tentions which occur when two or more processors attempt to access the same
memory unit concurrently. In Chapter 2, we have seen some configurations of
interleaved main memory suitable for multiprocessors. The degree of conflicts
can be reduced by increasing the degree of interleaving. However, this must be
coupled with careful data assignments to the memory modules. Another limiting
factor is the processor-memory interconnection network itself. This will be dis-
cussed in more detail later.
; vo |i
: Local bus i
1 Processor :
i (P) :
i i
! Channel i
‘ and i
t arbiter !
switch
i “4
ee mee a recent
(LM) vo| : (iM) vo
i
j
i
id
1
H;l
be
A
iL
.
‘
E
'
‘t
.
'
Message transfer system
(MTS)
collide in accessing a physical segment of the MTS, the arbiter is responsible for
choosing one of the simultaneous requests according to a given service discipline. It
is also responsible for delaying other requests until the servicing of the selected re-
quest is completed. The channel within the CAS may have a high-speed communica-
tion memory which is used for buffering block transfers of messages. The com-
munication memory is accessible by all processors. With the advent of VLSI
technology, the computer module can be fabricated on a single integrated circuit
and be used as the building block of a multiprocessor system.
The message-transfer system for a nonhierarchical LCS could be a simple
time shared bus, as in the PDP-11, or a shared memory system. The latter case
can be implemented with a set of memory modules and a processor-memory
interconnection network or a multiported main memory. In a multiported
memory system, the arbitration and selection logic of the switch are distributed
into the memory modules. The MTS is one of the most important factors that
determine the performance of the multiprocessor system. For LCS configurations
that use a single time shared bus, the performance is limited by the message
arrival rate on the bus, the message length, and the bus capacity (in bits per second).
WWW.Gitmgurgaon.blogspot.com
462 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Contentions for the bus increase as the number of computer modules increases.
For the LCS with a share memory MTS, the limiting factor is the memory conflict
problem imposed by the processor-memory interconnection network,
The communication memory may alse be centralized and connected to a
time shared bus, or be part of the shared memory system. Conceptually, a dis-
tributed or centralized communication memory can be considered as consisting
of logical ports which can be accessed by the processors. Processes (tasks) can
communicate with other processes allocated to the same processor, or with tasks
allocated to other processors. Associated with each task is an input port stored in
the local memory of the processor to which the task is allocated. Every message
issued by the task is directed to the input port of the destination task. Communica-
tion between tasks allocated to the same processor takes place through the local
memory only. Communication between tasks allocated to different processors is
through a communication port residing in the communication memory. One
communication port is associated with each processor as its input port.
The logical structure of the communication between tasks is shown in Figure
7.2. A process allocated to processor P, puts a message into the input port of
another task in P,, as illustrated by the arrow marked with a. The b arrows show
a two-step action in transferring messages between processors. Arrow b, sends a
message to the input port of processor P,. Arrow b, shows the moving of a message
to the input port of the destination process.
Processor 2
input port
Common memory
containing
communication
ports
Processor |
input port
Nl
Figure 7.2. Communication between processes in a multiprocessor environment.
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 463
T Map bus
Intercluster bus
Interclustet bus
Map bus
prcmemereee need
Intercluster buses
WWW.Gitmgurgaon.blogspot.com
464 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
f Map Bus
processor #
Slocal mapping
table
non-local
local
cm*-> example of
X-PSW
hierarchically loosely coupled
space design
18
immediate address
from processor
memory
local Gm
Figure 7.4 Address mapping in the Slocal of the Cm*. (Courtesy of Cm* project at Carnegie-Mellon
University, 1986.)
Intercluster bus 4
Intercluster bus 0
Service Queue
end
Return Quéeve
<—
Porta Port 1
Send Send
Queue Queue
Run Queue
Map bus
RET Kbus Out Queue Pmap
—{6>
| Map sus.
Figure 7.6 The steps in an intracluster memory access. (Courtesy of the Cm* project at Carnegie-Mellon
University, £980.)
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 467
WWW.Gitmgurgaon.blogspot.com
468 COMPUTER ARCHITECTURE AND PARALLEI. PROCESSING
Interciuster bus f 2 ‘
Kmap |
\ i
i allocate
| master context Oo master Kmao receives request from master Cm {
cm @ master Kmap oreoares wercluster message | Gm
allocate
slave context @ message travels fo Slave Kmap
slave Kmidao decodes request
© request tor memory cycle sent 10 destination Cm
{S) return result sent baca to slave Kmap
@) stave Kmao prepares return wterclusier message
stave Context () message returns to master Kmao
@ master Kmap recewes return messaga
masterfreecontent ©O) 5
resuit Sent to master Cm
Figure 7.7 The steps in a cross-cluster memory access. (Courtesy of the Cm* project at Carnegie-Mellon
University, 1980.)
the exception of the leaf nodes. The total number of nodes is (b” ~ 1)/(b — 1), and
the number of links is (b" — by¥(b ~ 1). In a binary tree, each cluster is connected
strictly to its two children and to a single parent. Communication between leaf
nodes faces a bottleneck toward the top of the tree. Hence, such tree structures per-
form well on a large range of problems. Binary tree structures have been shown
to be theoretically promising for sorting, matrix multiplication, and for solving
some NP-hard problems. The basic scheme involves divide-and-conquer tech-
niques.
width of the address within a module and & is the width of the data path. Hence
the crossbar switch for a p by / multiprocessor system has a complexity
O(pi(a + k)). For largep and /, the crossbar may dominate the cost of the multi-
processor system. If the crossbar switch is distributed across the memory modules,
a multiported memory results, The complexity of the multiported memory is
similar to that of the crossbar. Alternately, the PMIN could be a multistage
network, some examples of which were discussed in Chapter 5.
A memory module can satisfy only one processor's request in a given memory
cycle. Hence, if two or more processors attempt to access the same memory module,
a conflict oceurs which is resolved or arbitrated by the PMIN. If necessary the
PMIN may be designed to permit broadcasting of data from one processor to
two or more memory modules. To avoid excessive conflicts, the number of memory
modules / is usually as large as p. Another method used to reduce the degree of
conflicts is to associate a reserved storage area with each processor. This is the
unmapped local memory (ULM) in Figure 7.8a. It is used to store Kernel code
and operating system tables often used by the processes running on that processor.
For example, if each processor is multiprogrammed, each time a task switch is
desired the state of the process to be blocked may be saved in the ULM. The
ULM helps in reducing the traffic in the PMI'N and hence the degree of conflicts,
Interrupt signal
interconnection
network (ISIN) Input-output
channels
Disks
Processors p-l LosP
er inter- : :
0 connection * 2
network P+—{_o |
(lOPIN)
Unmapped
local memory (ULM)
‘ q
P/M interconnection
network (PMIN)
4 j
¥ Y
Shared memory modules | 0 | one i-1
WWW.Gitmgurgaon.blogspot.com
470) COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Interrupt signal
interconnection Input-output
network (ISIN) channels
Disks
Bt L-0/P
Processors . inter- : 3
= connection . =
0/ network p+—{_a]
(IOPEN}
Unmapped
local memory (ULM)
Memory map (MM) L eee [-
P/M interconnection
network (PMIN}
learn the full explanation of fig 7.8
4
¥ (A) and (B)
DMA and buffer[_| are
C
Mee
b-O =
Pipelined shared *
2
CO) Mon
memory modules .
=
Ly
cross points, A multiprocessor organization which uses a private cache with each
processor is shown in Figure 7.8b. This multiprocessor organization encounters the
cache coherence problem. More than one inconsistent copy of data may exist in
the system. Various solutions to the cache coherence problem are given in Section
7,3. Examples of multiprocessors with private caches are the IBM 3084 and the S-1.
In Figure 7.8, there is a module attached to each processor that directs the
memory references to either the ULM or the private cache of that processor. This
module is called the memory map and is similar in operation to the Slocal discussed.
earlier. The general scheme for implementing memory maps was discussed in
Chapter 2. The ISIN permits each processor to direct an interrupt to any other
processor. Synchronization between processors is facilitated by the use of such an
interprocessor network. The ISIN can also be used by a failing processor to
broadcast a hardware-initiated alarm to the functioning processors. The IOPIN
permits a processor to communicate with an 1/0 channel which is connected to
peripheral devices.
The complexity of the ISIN may vary from a simple time shared bus to a
complex crossbar switch. For example, in the Univac 1100/80 and Honeywell
60/66 multiprocessor systems, a connection is established between every pair of
processors for the ISIN. The Cmmp system uses a time shared bus for inter-
processor communication. A time shared bus is much cheaper than a crossbar
switch but encounters more contentions and delays due to bus-arbitration logic.
However, the interrupt request rate to the bus is usually low enough to make the
shared bus an attractive solution to interprocessor communication.
The set of processors used in a multiprocessor system may be homogeneous
or heterogeneous. It is homogeneous if the processors are functionally identical.
For example, the multiprocessor system of the IBM 3081K has two identical
processors. Even if the processors are homogeneous, they may be asymmetric.
That is, two functionally identical components may differ along other dimensions,
such as I/O accessibility, performance or reliability. Examples with symmetric
multiprocessor configurations are the Honeywell 60/66 and the Univac 1100/80.
Examples of the asymmetric multiprocessors are the attached processor systems
such as the IBM 3084 AP and the C.mmp.
In most cases, the asymmetry or symmetry of the multiprocessor system is
usually transparent to the user processes. It is only of interest to the operating
system, especially with respect to load balancing and other scheduling considera-
tions. In general, a homogeneous system is easier to program and eliminates the
connector problem, which arises in getting two dissimilar processors to effectively
communicate, The symmetric system usually can better facilitate error recovery,
in case of failure.
WWW.Gitmgurgaon.blogspot.com
472 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
ts Memories
Crossbar switch
aoe Processors
IOP JOP
vo lo Vo
channel channel channel
(oe) (oe en y
Memories
Crossbar switch
Faulty
Processors
processor
IOP,
Figure 7.10 Increased availability in an asymmetric 1/0 subsystem through redundant connections.
IOP, is still accessible through processor P when processor | fails. The availability
is provided at the cost of additional arbitration logic required for the multiple
paths. Also, the extra logic must be sufficiently reliable that the degradation it
introduces is more than compensated by the extra reliability of redundant paths.
However, if the reliability of the extra logic is poor, then the reliability and avail-
ability of the system in Figure 7.10 will be poorer than that of the original system.
The disadvantage of the fully symmetric case is the cost of the crossbar switch. This
cost can be reduced without significant sacrifice in availability by using a multi-
stage network such as the delta network discussed in the previous section, or a
multiported system. Three examples of a tightly coupled multiprocessor system
are the Cyber-170, the Honeywell 60/66, and the PDP-10, These examples are
briefed below and details can be found m Satyanarayanan (1980).
WWW.Gitmgurgaon.blogspot.com
474 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
: CPS
i
; CP, cP,
i i
i i
beeen ene ame seen enecamen ceeeeanes nf eneceme ceedrennee ance ee ees uneeeeree
neereaee mi
CM cmc ECM
PPS
M: central memory
WWW.Gitmgurgaon.blogspot.com
“Wo! WO!
i SS
pr a
WA ‘as WA WA So KY WW
WW "35 WI
z dD hp °9
476 COMPUTER ARCHIFECTURE AND PARALLEL PROCESSING
intelligent switch to route interrupts and other communications among the various
system components. When more than one element attempts to access the same
memory module, the corresponding system controller resolves the conflict. This
triple redundancy organization is particularly designed to enhance availability and
fault tolerance.
The PDP-10 muitiprocessor Figure 7.13 shows two configurations of the PDP-10
multiprocessor with multiported memory modules. Each CPU has a cache of 2K
words where each word is 36 bits. Figure 7.13a illustrates the asymmetric master-
slave configuration. The two processors are identical, but the asymmetry is a
result of the connection of the peripherals to the master only. Hence, the slave
cannot initiate peripheral operations nor respond to an interrupt directly. The
symmetrical configuration of the PDP-10 multiprocessor is shown in Figure 7.13b.
Both processors are connected to a set of shared fast and slow peripherals. How-
ever, each data channel is attached to one processor, which is the only processor
that can use it. Note that slow peripherals are connected to both processors via
a switch. There is no cache invalidate interface between them. It is assumed that a
software solution is used to enforce cache consistency.
The three tightly coupled multiprocessors discussed above are just a few of the
commercial systems available. There is a trend to achieve improved performance
M, em Up to 16 memory modules-——————— aM
Master Slave
processor processor
DC,
Processor Processor
0 1
{sis}
t
Shared slow peripherals
DC. / th data channel attached to the ith processor
WWW.Gitmgurgaon.blogspot.com
478 COMPUTER ARCHITEC FURE AND PARALLEL PROCESSING
shown in Figure 7.14. The current process register points to the register set
currently in use. For example, each processor in the S-1 multiprocessor system
has 16 sets of registers. Stack instructions which rapidly save and restore the
processor status word tend to minimize switching overheads. The implementation
of reentrant procedure calls is related to the stack manipulative structure of the
processor.
Large virtual and physical address spaces A processor intended to be used in the
construction of a general-purpose medium to large-scale multiprocessor must
support a large physical address space. Even when an algorithm is decomposed
so that it can be implemented using very small amounts of code, processes some-
times need to access large amounts of data objects. The 16-bit address space of the
processor used in the C.:mmp hampered effective programming of the system.
In addition to the need for a large physical address space, a large virtual
address space is also desirable, If possible, the virtual address space should be
segmented to promote modular sharing and the checking of address bounds for
memory protection and software reliability. For example, the processor used in
the S-{ multiprocessor system has 2 gigabytes of virtual memory and 4 gigawords
of physical memory, where each word is 36 bits wide.
Processor
f ]
| i
: Current process register i
ihae eeewmenrne
new eghim en ly cae en a nm cee rn en esennees ee nana dae neds eee aame ened
A
\
\ Context switch
° a
. . een *
e *
LN my a
Multiple register sets
Figure 7.44 Context switching in a processor with muitiple register sets.
WWW.Gitmgurgaon.blogspot.com
480 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
instruction set The instruction set of the processor should have adequate facilitiés
for implementing high-level languages that permit effective concurrency at the
procedure level and for efficiently manipulating data structures. Instructions
should be provided for procedure linkage, looping constructs, parameter manipula-
tion, multidimensional index computation, and range checking of addresses,
Furthermore, the instruction set should also include instructions for creating and
MULTIPROCESSOR ARCHITECTURE ANT PROGRAMMING 48I
+, +, +
i/o
Processor Processor
processor
WWW.Gitmgurgaon.blogspot.com
482 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
1
Bus Control Memory 1/0
modifier logic Processor Processor units devices
i f r Y
increases the bus contention, which degrades system throughput and increases
arbitration logic. The total overall transfer rate within the system is limited by the
bandwidth and speed of this single path. For this reason, private memories and
private I/Os are highly advantageous. Interconnection techniques that overcome
these weaknesses add to the complexity of the system.
An extension of the single path organization to two unidirectional paths, as
shown in Figure 7.16, alleviates some the problems mentioned above without an
appreciable increase in system complexity or decrease in reliability. However, a
single transfer operation in such a system usually requires the use of both buses,
hence not much is actually gained.
The next step in alleviating the limitations of the time shared bus is to provide
multiple bidirectional buses, as shown in Figure 7.17, to permit multiple simul-
taneous bus transfers; however, this increases the system complexity significantly.
In this case. the interconnection subsystem becomes an active device. A number
of computer systems, such as the Tandem-16 and Pluribus, employ variations
of the time shared system of buses discussed above. In general, the above organiza-
tions are usually appropriate for small systems,
In view of the increasing numbers and speeds of devices attached to a central
bus as a result of changing technology and applications, the bus can become heavily
loaded. Therefore, the bus impairs the performatice of the devices and, thus, of the
overall system. There are several factors that affect the characteristics and per-
formance of a bus. These include the number of active devices on the bus, the
bus-arbitration algorithm, centralization (or distribution) of control, data width,
synchronization of data transmission, and error detection. We will examine several
pot
1/0, Py Pp, P,
WWW.Gitmgurgaon.blogspot.com
484 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
(A) The static prierity algorithm Many digital buses used today assign unique
static priorities to the requesting devices. When multiple devices concurrently
request the use of the bus, the device with the highest priority is granted access to
it. This approach is usually implemented using a scheme called daisy chaining,
in which all services are effectively assigned static prierities according to their
locations along a bus grant control line. The device closest to a central bus con-
troller is assigned the highest priority (Figure 7.18), Requests are made on a
common request line, BRQ. The central bus control unit propagates a bus grant
signal (BGT) if the acknowledge signal (SACK) indicates that the bus is idle.
The first device which has issued a bus request that receives the BGT signal
stops the latter's propagation. This sets the bus-busy flag in the contro/ler and the
device assumes bus control. On completion, it resets the bus-busy flag in the
controller and a new BGT signal is generated if other requests are outstanding.
The DEC PDP-11 Unibus uses this approach. The Motorola MC68000 pro-
cessor incorporates such a bus control unit.
Another DEC bus called the synchronous backplane interconnect (SBI) and
used in the VAX 11/780 computer implements static priorities using a distributed
scheme called parallel priority resolution, in which the time required to determine
which requesting device has the highest priority is fixed (unlike daisy chaining).
Using static priorities clearly gives preferences and, thus. lower wait times to
devices with higher priorities.
(B) The fixed time slice algorithm Another common bus-arbitration algorithm
divides the available bus bandwidth into fixed-length time slices that are then
sequentially offered to each device in a round-robin fashion. Should the selected
device clect not to use its time slice, the time slice remains unused by any device.
This technique, called fixed time slicing (FTS) or time division multiplexing (TDM),
is used by Digital’s Parallel Communications Link, which also allows a flexible
assignment of available time slices to the devices. This scheme is usually used with
synchronous buses in which all devices are synchronized to a common clock.
The service given to each device in the FTS scheme for access to the bus is
independent of that device’s position or identity on the bus; schemes with this
characteristic are said to be symmetric. In particular, all #1 devices are given one
out of every 7 time slices at fixed intervals in this scheme. Symmetric bus-arbitra-
tion algorithms optimally load-balance all bus requests because no preference is
given to any device. It further delivers a bounded maximum wait time to the
devices. However, it suffers a high average wait time (and, thus, a lower bus
utilization).
When the bus is not heavily loaded, FTS incurs a substantially higher standard
deviation from ali wait times than does the static priority scheme, although the
variability of service is lower and remains constant regardless of the bus load.
Both algorithms offer good performance under light bus loading; these charac-
teristics and their relative simplicity explain their widespread popularity.
(C) Dynamic priority algorithms The following dynamic priority algorithms allow
the load-balancing characteristics of symmetric algorithms such as fixed time
slicing to be achieved without incurring the penalty of high wait times. The devices
are assigned unique priorities and compete to access the bus, but the priorities are
dynamically changed to give every device an opportunity to access the bus. If the
algorithm used to permute the priorities favors no individual device (is symmetric),
then the system load balances the bus requests. Further, using priorities overcomes
the inefficiency inherent in the fixed time slice scheme of allocating full time slices
to the devices before requests are placed. Two algorithms for dynamically per-
muting priorities are the least recently used (LRU) and the rotaring daisy chain
(RDC).
The LRU algorithm gives the highest priority to the requesting device that
has not used the bus for the longest interval. This is accomplished by reassigning
priorities after each bus cycle. The second dynamic priority algorithm generalizes
the daisy chain implementation of static priorities. Recall that in the daisy chain
scheme all devices are given static and unique priorities according to their priorities
on a bus-grant line emanating from a centra] controller.
In the RDC scheme, no central controller exists, and the bus-grant line is
connected from the last device back to the first in a closed loop (Figure 7.19).
Whichever device is granted access to the bus serves as bus controller for the
folowing arbitration (an arbitrary device is selected to have initial access to the
bus). Each device’s priority for a given arbitration is determined by that device's
distance along the bus-grant line from the device currently serving as bus controller;
the latter device has the lowest priority. Hence, the priorities change dynamically
with each bus cycle.
WWW.Gitmgurgaon.blogspot.com
486 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Partialbus oy ¥ "7 an
controller
pe ee fee a ane eee mnommleed
BGT
| 4 + BRQ
< Ba >
Figure 7.19 Rotating daisy chain implementation of a system bus.
Bus flog,
mi lines
control
unit “4 + $i + SACK
sd -- BRQ
and bus grant (BGT,} line are connected to each device i sharing the bus, as showa
in Figure 7.21. This requesting technique can permit the implementation of LRU,
FCFS, and a variety of other allocation algorithms.
BGT, “A CF {\ j gy
BRQ,
BOT,
ERO,
Bus
control
unit
BGT,
BRQ,,
_. SACK 4
KS Bus >
Figure 7.24 Independent request implementation of a system bus.
WWW.Gitmgurgaon.blogspot.com
488 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
eet
4 r L
yD
04.
Figure 7.22 Crossbar (nonblocking) switch system organization for multiprocessors. (Courtesy of ACM
Comparing Sarveys, Enslow 1977.)
The crossbar switch possesses complete connectivity with respect to the memory
modules because there is a separate bus associated with each memory module.
Therefore, the maximum number of transfers that can take place simultaneously is
limited by the number of memory modules and the bandwidth-speed product of
the buses rather than by the number of paths available.
The important characteristics of a system utilizing a crossbar interconnection
matrix are the extreme simplicity of the switch-to-functional unit interfaces and
the ability to support simultaneous transfers for all memory units. To provide
these features requires major hardware capabilities in the switch. Not only must
each cross point be capable of switching parallel transmissions, but it must also
be capable of resolving multiple requests for access to the same memory module
occurring during a single memory cycle. These conflicting requests are usually
handled on a predetermined priority basis. The result of the inclusion of such a
capability is that the hardware required to implement the switch can become quite
large and complex, Although very large scale integration (VLSD can reduce the
size of the switch, it will have little eflect on its complexity.
In a crossbar switch or multiported device, conflicts occur when two or more
concurrent requests are made to the same destination device. In the following
discussion, we assume that there are 16 destination devices (memory modules)
and 16 requestors (processors). The implementation to be described can also be
used for a processor to device connection. Figure 7.23 shows an example functional
design of a crossbar switch element or multiperted memory for one module. The
switch consists of arbitration and multiplexer modules, Each processor generates
a memory module request signal (REQ) to the arbitration unit, which selects the
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 489
Data
Multiplexer
modules RD/WR From Py- P|,
addr
Memory
module 44° comrol
REQ,
M ACK,
‘eae Arbitration
Reg
ACK,
module
REQ.
ACK,,
processor with the highest priority. The selection is accomplished with a priority
encoder. The arbitration module returns an acknowledge signal (ACK) to the
selected processor. After the processor receives the ACK, it initiates its memory
operation.
The multiplexer module multiplexes data, addresses of words within the
module, and contro! signals from the processor to the memory module using a
16-to-1 multiplexer. The multiplexer is controlled by the encoded number of the
selected processor. This code was generated by the priority encoder within the
arbitration moduie.
Such a scheme was used to implement the processor-memory switch for the
C.mmp, which has 16 processers and 16 memory modules. The switch consists of
16 sets of cross points from one processor port to the 16 memory ports, and
another 16 sets of cross points fram one memory port to the 16 processor ports.
Theoretically, expansion of the system is limited only by the size of the switch
matrix, which can often be modularly expanded within initial design or other
engineering limitations. One effect of VLSI on the crossbar interconnection system
is the feasibility of designing crossbar matrices for a larger capacity than initially
required and equipping them only for the present requirements. Expansion would
then be facilitated. since ali that is required is the addition of the missing cross
points.
In order to provide the flexibility required in access to the input-output
devices, a natural extension of the crossbar switch concept is to use a similar switch
on the device side of the I/O processor or channel, as shown in Figure 7.24. The
WWW.Gitmgurgaon.blogspot.com
490 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
M, M, M,
Yo,
v0, 14
1
hardware required for the implementation is quite diferent and not nearly so
complex because controllers and devices are normally designed to recognize their
own unique addresses. The effect is the same as if there were a primary bus
associated with each I/O channel and crossbuses for each controller or device.
The crossbar switch has the potential for the highest bandwidth and system
efficiency. However, because of its complexity and cost, it may not be cost-effective
for a large multiprocessor system. The reliability of the switch is problematic;
however, it can be improved by segmentation and redundancy within the switch.
in general, it is normally quite easy to partition the system to logically isolate
malfunctioning units. There are a number of examples of systems utilizing the cross-
bar interconnection systems. Some of these are the C.mmp and the S-1 multi-
processor systems, which are to be discussed in Chapter 9.
if the control, switching, and priority arbitration logic that is distributed
throughout the crossbar switch matrix is distributed at the interfaces to the memory
modules, a multiport memory system is the result, as the example shows in Figure
7.25. This system organization is well suited to both uni- and multiprocessor
sysiem organizations and is used in both. The method often utilized to resolve
memory-access conflicts is to assign permanently designated priorities at each
memory port. The system can then be configured as necessary at each installation
to provide the appropriate priority access to various memory modules for each
functional unit, as shown in Figure 7.26. Except for the priority associated with
each, all of the ports are usually electrically and operationally identical. In fact,
the ports are eften merely a row of identical cable connectors, and electricaily it
makes no difference whether an 1/O or central processor is attached.
The flexibility possible in configuring the system also makes it possible to
designate portions of memory as private to certain processors, I/O units, or com-
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 491
Oo,My vO,
Figure 7.25 Muitiport memory organization without fixed priority assignment. (Courtesy of 4CM Com-
puting Surveys, Easlow 1977.)
Vo, 9 i/0
Figure 7.26 MulGport-memory system with assignment of port priorities. (Courtesy of ACAY Computing
Surveys, Enslow 1977.5
WWW.Gitmgurgaon.blogspot.com
492 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
4 4
vo YO,
Figure 7.27 Multiport memory organization with private memories. (Courtesy of ACM Computing
Surveys, Enslow 1977.)
failure,
4. Expanding the system by the addition of functional units may degrade overall system performance
(throwgkput),
5. The system efficiency attainable is the lowest of all three basic interconnection systems.
6. This organization is usually appropriate for smaller systems only.
switch.
3. Because a basic switching matrix is required io assemble any functional units into a working con-
figuration, this organization is usually cost-effective for multiprocessors only,
4, Systems expansion (addition of functional units) usually improves overall perforraance. There is the
highest potential for system efficiency such as for system expansion without reprogramming of the
operating system.
5. Theoretically, expansion of the system is limited only by the size of the switch matrix, which can
often be modularly expanded within imitial desiga or other engineering limuations.
6. The reliability of the switch, and therefore the system, can be improved by segmentation and/or
redundancy within the switch,
Multiprecessors with muitipart memory:
|. Requires the most expensive memory units since most of the control and switching circuitry is
included in the memory unit.
2. The characteristics of the functional ynits permit a relatively low cost uniprocessor to be assembled
from them,
3. There is a potetitial for a very high tota! transfer rate in the overall system.
4. The size and configuration options possible are determined (limited) by the number and type of
memory ports available; this design decision is made quite early in the overall desiga process and
is difficult to modify.
5. A large number of cables and cosnectors are required.
of connecting the input 4 to either the output labeled 0 or the output labeled 1,
depending on the value of some control bit ¢, of the input 4. Ife, = 0, the input
is connected to the upper output, and if c, = 1, the connection is made to the
lower output. Terminal B of the switch behaves similarly with a control bit cz.
The 2 x 2 module also has the capability to arbitrate between conflicting requests.
Hf both inputs 4 and B require the same cutput terminal, then only one of them
will be connected and the other will be blocked or rejected.
The 2 x 2 switch shown in Figure 7.28 is not buffered. In such a switch, the
performance may be limited by the switch setup time which is experienced each
time a rejected request is resubmitted. To improve the performance, bufers can
be inserted within the switch, as shown in Figure 7.29. Such a switch has also
been shown to be effective for packet switching when used in a multistage network.
WWW.Gitmgurgaon.blogspot.com
494 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
A 0 “AY
Be io 1 Be -* 3
Control Control
bit of A bit of A
C,=0 C,=! Figure 7.28 A 2 x 2 crossbar switch.
aR 000
| TP. 001
FP011
tL 100
pe 5 101
Pm
Beet
10 1
Figure 7.30 t-by-8 denmultiplexer implemented with 2 x 2 switch boxes,
s of a node is the number of ares fanning into it. An Cf, s, ) banyan network can
thus be described as a partially ordered graph with / levels in which there is exactly
one path from every base to every apex node. The fanout of each nonbase node
is fand the spread of each nonapex node is s. Each node of the graph is ans x f
crossbar switch.
The banyan network can be derived from a uniform tree with fanout /. We
illustrate the derivation of a (2, 2, 2) banyan network from the two-level binary
tree shown in Figure 7.3 1a. Since the spread is 2, it means that two arcs should be
fanning into each nonroot node. Therefore, we replicate the rest to have s copies
of the root and attach the root to the next level nodes, as shown in Figure 7.310.
To make the spread of the leaf nodes equal to two, replicate the top two levels
(interleaving the second level nodes). Join the second level nodes to the leaf nodes
to make the fanout of the second level nodes equal to 2 and the spread of the leaf
nodes equal to 2, This completes the derivation of the (2, 2, 2} banyan network and
is shown in Figure 7.3le.
A banyan network has the advantage of providing a complete interconnection
of one set of n devices to another set of n devices at a cost in switching circuitry that
grows as » log a. A crossbar switch, by contrast, grows as n?, In general, an (fs,
banyan network can be defined as / recursions on an 5 x / crossbar switch. A
cross point is thus a banyan of height 1. Hence different topologies of banyan
networks can be implemented for multiprocessors. However, more studies need
WWW.Gitmgurgaon.blogspot.com
“qdei3 uedueg (z ‘Zz *z} # Jo uonEMOg [E71 andi
J9Ad] IXAU ayf1 Jo UOTE day (2) Z peaids
— 1007 ayi Jo uoneoyday (q} {{aaaq Z) ZT nowy ‘aay y (v}
496
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 497
qi mod(ge — 1) forO<si<qe-1
Squci) \i for i = gc
—| (7.2)
0 < d; < b, then the base-b digit d; controls the crossbar modules of stage (n ~—- i).
The a-shuffle function is used to convert the outputs of a stage to the inputs of the
next stage, where the inputs and outputs are numbered 0, 1, 2,..., starting at the
WWW.Gitmgurgaon.blogspot.com
498 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
i S430)
oC} »()0
2 2
3 3
4 4
5 5
6 6
7 ?
8 8
9 9
10 10
top. Figure 7.33 shows a general a" x b" delta network which has a” sources and
5" destinations. Numbering the stages of the network as 1, 2,...,n, starting at the
source side of the network requires that there be a”! crossbar modules in the first
stage. The first stage then has a"~'6 output terminals. This implies that stage two
must have a"~'b input terminals, which requires a"~? crossbar modules in the
second stage. In general, the ith stage has a"~‘b'~! crossbar modules of size a x 6.
Thus, the total number of a x b crossbar modules required in an a" x 6" delta
network can be found as (a” — n")/(a — b), for a & b and nb"~' for a = b. Two
delta networks, one 47 x 3? and the other 2° x 2°, derived from Figure 7.33 are
shown in Figures 7.34 and 7.35, respectively; the interstage link patterns are 4
shuffle and 2 shuffle, respectively. Note that the destinations in Figures 7.34 and
7.35 are labeled in bases 3 and 2, respectively. It has been shown that the @-shuffle
link pattern used between adjacent stages allows a source to connect to any destina-
tion by using the destination-digit control of each a x 6 crossbar module.
499
CLR6L PIA ‘staptduoD *suv4y Zaz Jo AsayinoD) -yiompPU WIP Wg x PUY col andy
w a8nig SusAysD z a8eig aunys 2 i a8eig
I-14 += eth = = een
- ~ jhe
:
| 1
* : 1-1.
: I~ (-af : 1-4z- :
:
.
s .
a KD bee enn teen < quo —
aXe ~
T- 4a~ [shen wee woe ~ ~t } 2)
jeffme
|
'
: * ‘ |
i
.
.
WWW.Gitmgurgaon.blogspot.com
z
: i .
} s
*
t ° t
i
j }
i i
i
E-9t 1-07
-7T Leetfenee ow -- 2-- == — att
° t
| : : I *s
e
s
:
e a :
4g tg Kee ee oe — T+9 qxo pt—— [+0
Gee ~
] th bef —— nae = ene = el <—
~ + q p— 77
“ E-@ 1-2
1 & teem ee t
: 0
: 9 : :: 0 :
®
. .
\
e
Tt ern ee eee qxD ft ~< | axe
oa na one
Lemm need - 0
6+ +
500 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
o—
0
|r
l
a 2 (4,4),
—
O-~ 90
\;—O1
2}-— 02
4
0
—
1
—
2
l—
Or—10
ire
2 passnne 12
& aed
0
|
1
10) mere
2
W
OF 20
oO
1-21
2 prone2.2
12 sss
0
13—~
1
14
orm
2
1s—
Stage 1 4 shuffle Stage2
Figure 7.34 A 4* x 3? deita network. (Courtesy of TEEE Trans, Computers, Patel 1981.)
Note that the network of Figure 7.35 does not allow an identity permutation,
which is useful if, say, memory module i is a “favorite” or home module of pro-
cessor i. Therefore, identity permutation allows most of the memory references to
be made without conflict. A simple renaming of the inputs of Figure 7.35 will
permit an identity permutation. This is shown in Figure 7.36. In this case, if all
2 x 2 switches were in the straight position, then an identity permutation is
generated.
In general, 5 ts a power of 2 and a is very small, usually between 1 and 4,
Figure 7.37 illustrates the functional block diagram of a 2 x 2 crossbar module.
All single lines in the figure are 1-bit lines. The double lines on the INFO box
represent address lines, incoming and outgoing data lines, and a read-write control
line. The data lines may or may not be bidirectional. The function of the INFO box
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 501
(4,9, 40),
0 0 0 Or 000
1+ 1 1 t*oa1
2~ 0 0 Op 010
I=] 1 l 1p 01
4— 0 0 0 100
5 1 1 IP 101
6— 0 0 Or 110
T med 1 1 te111
Stage 2 shuffle Stage 2 shuffle Stage
1 2 3
Figure
Es 7.35 A 2° x 23 delta network. (Courtesy of EEE Trans. Computers, Patel 1981.)
is that ofa simple 2 x 2 crossbar; ifthe input X is 1, then a cross connection exists,
and if X is 0, then a straight connection exists.
The function of the control box is to generate the signal X and provide arbitra-
tion. A request exists at an input port if the corresponding request line is 1. The
destination digit provides the nature of the request; a 0 for the connection to the
upper output port and a 1 for the lower port. In case of conflict, the request rp is
given the priority and a busy signal b, = 1 is supplied to the lower input port. A
busy signal is eventually transmitted to the source which originated the blocked
request. The logic equations for all the labeled signals are given with the block
diagram. For the INFO box, the equations are given for left to right direction.
The parallel generation of X and X reduces one gate level.
The operation of a 2" x 2” delta network using the above described 2 = 2
modules is as follows: Recall that there are n stages in this network. All processors
requiring memory access must submit their requests at the same time by placing a
WWW.Gitmgurgaon.blogspot.com
502 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
2 \ L- O10
3 re Oll
4 100
5 | F LOL
6 110
7 Pm Lil
I on the respective request lines. If the busy line is 1, then the processor must re-
submit its request. This can be accomplished simply by doing nothing; i.e, con-
tinue to hold the request line high. Thus the operation of the implementation
described here is synchronous; that is, the requests are issued at fixed intervals at
the same time. An asynchronous implementation is preferable if the network has
many stages. However, such an implementation would require storage buffers
for addresses, data and control in every module and also a complex control
module. Thus, the cost of such an implementation might well be excessive.
Busy i,
CONTROL
—e 7,
smveinnapand a, R, te
themed By _ Bde
x x
INEG
1 i>
2. At the beginning of every cycle, each processor generates a new request with a
probability r. Thus, r is also the average number of requests generated per cycle
by each processor.
3. The requests which are blocked (not accepted) are ignored; that is, the requests
issued at the next cycle are independent of the requests blocked.
The last assumption is there to simplify the analysis. Although the model
does not make proper account of rejections, it still serves a useful purpose. It can
be solved exactly and it gives a lower bound on the expected bandwidth. In practice,
of course, the rejected requests must be resubmitted during the next cycle or
buffered in the module where the conflict occurs; thus the independent request
assumption will not hold. Later, the last assumption will be relaxed to improve
WWW.Gitmgurgaon.blogspot.com
504 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
the model, Moreover, simulation studies performed by many authors for similar
problems have shown that the probability of acceptance is only slightly lowered
if the third assumption above is omitted. Thus the results of the analysis are
fairly reliable and they provide a good measure for comparing different networks.
E(i) = mt = — yy m= _ ("yn
Thus the expected bandwidth B(p, 1), that is, the average number of requests
aceepted per cycle, is
Bip)= YEU)
- i)
OQ<siep
Ap, m) = m — mi - *) (7.4)
Let us define the ratio of expected bandwidth to the expected number of requests
generated per cycle as the probability of acceptance P,. P., is the probability that
an arbitrary request will be accepted. Therefore
_ Bem) om om ry
Py=> Tp rp rp ( “) (7.5)
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING SOS
Pa
t4 = Beer — Py) 08)
and
dw = 1 ds
The request rate r should be defined more precisely as the rate assuming
conflict-free accesses. We refer to r as the static request rate. However, memory
(nr)+ FP,
Figure 7.38 Markev graph for computing dsnamic request rate +’.
WWW.Gitmgurgaon.blogspot.com
506 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
requests are also made during each wasted cycle. Therefore, the memory madules
encounter a dynamic request rate r’ which is actually higher than the static request
rate because of memory conflicts. r’ can be obtained from the Markov graph as
P
K = 1G4 + Gy =e (7.9)
Therefore Eq. 7.5 becomes
oral (a
Equations 7.9 and 7.10 define an iterative process by which we can compute P ,
"”
for a given m, p, and +. r’ can be initialized to r for the iterative process.
Thus, P, is a measure of the wasted cycles of blocked requests. A higher Py
indicates a lower number of wasted cycles and a lower P, indicates higher number
of wasted cycles. The average number of wasted cycles w per request can be
computed if we note that a request that is rejected i times consecutively before it
is accepted waits for i cycles:
ira)
w= Yi — PyyP,
f=1
_1-P,
= P, (7.11)
Analysis of delta networks Assume a delta network of size a" x b" constructed
from a x b crossbar modules. Thus, there are a" processars connected to h”
memory modules. We apply the result of Eq. 7.4 for a p x m crossbar to an
a x b crossbar and then extend the analysis for the complete delta network.
However, to apply Eq. 7.4 to any a x & crossbar module, we must first satisfy
the assumptions of the analysis. We show below that the independent request
assumption halds for every a x b module in a delta network.
Each stage of the delta network is controlled by a distinct destination digit
(in base 6) for the setting of individual a x b switches. Since the destinations are
independent and uniformly distributed, sc are the destination digits. Thus, for
example, in some arbitrary stage i, an a x b crossbar uses digit d,_; of each re-
quest; this digit is not used by any other stage in the network. Moreover, no digit
other than d,_, is used by stage i. Therefore, the requests at any a x b module are
independent and uniformly distributed over 6 different destinations. Thus we can
apply the result of Eq. 7.4 to any a x b module in the delta network.
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 507
Given the request rate r at each of the a inputs of ana x b crossbar module,
the expected number of requests that it passes per time unit is obtained by setting
p= aand m = bin Eq. 7.4, which is
r\®
b—of1 ~ 5)
1 ~ (: - *) (7.12)
Thus for any stage of a delta network, the output rate of requests, r,,,, is a
function of its input rate, 7,,, and is given by
rei\
where r= 1 — —p and ro =r
Py=— (7.15)
a"r
Since we do not have a closed-form solution for the bandwidth of delta net-
works (Eq. 7.14), we cannot directly compare the bandwidths of crossbar (Eq, 7.4}
and delta networks, However, we present plots that compare the performance
of crossbar and delta networks using Eqs. 7.5 and 7.15. Figure 7.39 shows the prob-
ability of acceptance, P,, for 2" x 2° and 4" x 4" delta networks andp x pcross-
bar, when the request rate for each processor isr = 1. The curve marked delta —2
is for delta networks using 2 x 2 switches and delta —4 for delta networks using
4 x 4 switches. Notice that P, for crossbar approaches a constant value as was
predicted by Eq. 7.7. P, for delta networks continues to fall asp grows. The model
refinement developed for the crossbar switch can also be applied to delta networks
iteratively and is left as an exercise for the reader.
WWW.Gitmgurgaon.blogspot.com
508 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
1.0
08
a
Bre Crossbar
2
= 06
oa
a
Pa
E
ou
a
= O4 ne Delta - 4
on
Ba = se as, wm ee soy
7 ee ny
any
02
colt i tt | tt ft 1 i
k 4 16 64 256 1024 4096
Number of processors, p
Figure 7.39 Probability of acceptance of p x p networks. (Courtesy of EEE Trans. Computers, Patet
1981.)
This section addresses techniques for designing parallel memories for loosely and
tightly coupled multiprocessors. The interleaving method presented is an extension
of techniques applied to memory configurations for pipeline and vector processors.
Many commercial multiprocessor systems are tightly coupled, where each pro-
cessor has a private cache. The presence of multiple private caches introduces
the problem of cache coherence or multicache consistency. Various solutions to
cache-caherence problems are presented. Finally, we describe some simple models
to evaluate the effectiveness of the various memory configurations.
module M; is called the home memory for processor i. If the entire set of active
pages of a process being executed on processor f is contained in memory M;, and
if memory M, contains no pages belonging to processes running on other pro-
cessors, then processor i encounters no memory conflicts.
If every processor has the entire set of active pages of those processes that are
running on it in its home memory, there will be no memory conflicts. The concept
of home memory can be extended so that a set of modules {M,} are assigned as the
home memories of processor i, This assumes that there are more memory modules
than processors, so that at all times each memory module is associated with one
processor. Thatis,{M;} 0 {M,} = $,fori #j. The home-memory organization for
multiprocessors has an additional architectural advantage beyond the reduction
in memory interference.
The processor-memory interconnection network (PMIN) ofa multiprocessor
system may be expensive, slow, and complicated. Figure 7.40 is an alternative
organization in which each memory has two ports, one of which connects to the
PMIN and one of which connects directly to the home processor. This topology
1 + ?
{TN +
2
wa
g
5
e
Boe *
*
*
P oe a
f Str.----
——] i
I 2 eee M
M, M, M,,
Memories
Figure 7.40 Home memory concept. (Courtesy of EEE Trans, Computers, Smith 1978.)
WWW.Gitmgurgaon.blogspot.com
510 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Processors 1 2 eee p
Private
cache
Crossbar switch
Line i Le LC a LC em :
control |
i
Lom Fo EO mise j
i
memory een i
(MM) :
:. :e :* !
1 t
organization. For a degenerate case in which there is one module per line, the
memory conflict problem arises when two or more simultaneous memory requests
reference the same module, hence the same line. For the two-dimensional memory,
a conflict may also occur when a memory request references a busy line or a busy
module on a line. A meriory configuration characterized by (/, ni) is a particular
realization of the L-M memory organization.
Let us assume that a write-back-write-allocate cache replacement policy is
adopted in the following discussion. Let w, be the probability that the block frame
to be replaced was modified. If a cache block frame which has not been modified
is to be replaced, it is overwritten with the new block of data. However, a modified
block frame that is to be replaced must be written to main memory (MM) before a
block-read from MM is initiated. In this case, two consecutive transfers are made
between the cache and MM. Hence, we assume that each time a cache miss occurs
with probability w,,a block-write to MM is required with probability w,, followed
by a block-read from MM.
One method of organizing the cache for block-reads and -writes is to assume
that the two consecutive block transfers (one block-write followed by one block-
read) are made between a processor and the same line. This assumption will be
WWW.Gitmgurgaon.blogspot.com
§12 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
satisfied if a set-associative cache is used in which all the blocks that map to the
same set are stored on the same line. This assumption implies that the number of
sets is a multiple of the number of lines. Hence, in this method a cache miss requires
the iransfer ofa 2b-word block with a probability w, and the transfer ofa b-word
with a probability | — wy.
The L-M memory organization is very cost-effective in matching the bandwidth
of a cache memory which initiates block-transfer operations as a result of cache
misses. The information is distributed in memory so that each block of a program
resides on a line of memory. Consecutive words ofa block are stored in consecutive
modules on the same line. In this case, a line controller (LC) is associated with each
line, The controller typically receives a cache request for a block transfer ofsize 6 and
thereby issues b internal requests (IR) to consecutive modules on the line. The
blecks in the memory are interleaved on the lines so that block / is assigned to
modules on line i mod /.
When the main memory is used in the block-transfer mode, the address hold
time or cycle of the memory module can be chosen to be equal to the cache cycle time
in order to effectively utilize the line. In practice, the address cycle can be made as
small as the cache cycle time by incorporating an address latch in each memory
module. Let the cache cycle time be the unit time. Therefore, the memory cycle can
be expressed as c time units. Also, the modules on a line are interleaved in a par-
ticular fashion so that the servicing of two memory requests could be overlapped
on the same line. The modules on a line are interleaved so that a block of data of
size b( 2") is interleaved on consecutive modules on that line. Let line i and
modulej on that line be referred to as L, and M, ,, respectively, forO <i <i —1
and 0 <j < m — 1. Then the &th word of the block of data which exists on line
His in module k mod mon that line, for0 < k < b — 1. Itisimportant to note that
the first word ofa block which exists on line / is in the first module M; . of that line.
Ifb <m, memory modules M;y, Migs i.) Mim- 1 Will not be utilized since a
block starts in module M,.. Hence, for effective utilization of memory modules,
it is assumed that b > m.
When a block request is accepted by a line i, the line controller at that line
issues b successive internal requests to consecutive modules on line 1, starting from
module M, ,. It is assumed that these internal requests are issued at the beginning
of every time unit. Therefore, the internal request for the kth word of the block
will be issued to module M,,,, where j = k mod m,for0 =k < b — 1. Itis obvious
that this set ofb internal requests is not preemptible. Note that if > m or if the
cache is set associative, the (m + 1)st internal request is for module M,,. Conse-
quently, the first internal request must be completed by the time the Gn + L)st IR is
issued. This constraint is satisfied if ¢ < m,
In order to visualize the concurrent servicing of two memory requests on the
same line, we define a time unit, <t, ¢ + 1>, as beginning at time ¢* and ending at
time (t + 1)”. Therefore, the successive IRs which are generated to modules on a
line, in the servicing of a memory request, do not encounter any conflicts. If a
memory request is accepted on line i at time t, then the IR for the kth word of a
block of size b is initiated at time t + k to module M,, for j = k mod m and
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING S13
WWW.Gitmgurgaon.blogspot.com
514 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Processor q
ro
node
34
III
e e e
e e e
° e e
Se D
j
Thiok tte! tee
time Line service Drain
time time
“reac rmastormemm ee!
Actual service
time
lt
Figure 7.42 Closed queueing network model for a multiprocesser system.
poe
we ee oy
i\ Interactive j
states |
5k 1 Dg
|
poem] A we] w gee] LT pepe! DT
I
i
Ma me ane ee eed
—ait
Figure 7.43 State graph model for memory requests in a tightly coupled multiprocessor.
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 515
state has a different average duration. Each processor goes through an independent
state (state A) followed by interactive states (states W and LT) and another inde-
pendent state (DT).
In an independent state, a processor executes on its own node without conflict.
Interactive states are characterized by a potential for conflicts with other pro-
cessors. Hence, during any LT state, the memory line is busy and no other processor
can access the line. Let € be the average memory-request cycle time. The perfor-
mance index is the average processor utilization U, defined as the average fraction
of time spent by each processor in processing instructions. This performance
index reflects the degree of matching between the processors and the memory
organization.
We number the processors from | to p and the memory lines from | to /. Let
Ht) = [ig (fy ale pO] (7.16)
fork = 1,..., 4 with &, (rf) = 1 if processorj is not waiting for or using line k, and
i, {t) = O if processorj is waiting for or using line & at time ¢.
1,2) is called the indicator vector for line k at time i. Each component 4, (¢)
indicates whether or not processorj is waiting for or holding line k. Note that a
processor waits for or holds a line whenever it is in state W or LT (interactive
States), respectively. Let ¥, be the probability that a line & is busy and 5S, is the
average line service time of a request. Then
X, = Prob[at least one processor is waiting for or holding line 4]
= | — Prob[(no processor is waiting for or holding line k]
= | = Probli, tie hp = ij=1~—~ ELi,, thoaccth pl
This last equality results from the fact that the expectation of a random variable
taking only the values 0 and | is equal te the probability of the variable being 1.
The rate of completed requests by line & is X,,/S,.
In equilibrium, this rate can be equated to the rate of submitted requests to a
line. To compute this second member of the equation, we note that a processor
submits a request whenever it departs from state A. This occurs for each processor
whenever a cycle in the network of Figure 7.43 is completed. Recall that C is the
average time taken by sucha cycle. The rate of submitted requests for the memory by
any one processor is 1/C. Since there are p requesting processors and each request
is submitted with probability q, to line k, the average rate of submitted requests
to line k is pg,/C.
Let Y be the average fraction of time a given processor is in an independent
state. Hence, ¥ is also the probability of being in such a state. The symmetry of the
system implies the same value of Y for all the processors. Since T is the average
timeinstate A, ¥ = (T + D)/C. Substituting for 1/C in the equation for the average
rate of submitted requests to a given line and equating this rate to the rate of com-
pleted request, we obtain ¥, = S,pYq,/(T + D). Substituting for ¥,, we have
+ oY
Eli, schwacctlepl =1 (7.17)
where p, = 5, pq,T + D).
WWW.Gitmgurgaon.blogspot.com
516 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
This equation is exact. However, the first term of the left hand side of the
equation is very complex to estimate in general. The approximation consists in
neglecting the interactions between processors. As a result of the approximation,
the components of {,(1) are net correlated. This approximation performs best for
a short and deterministic line-service time. Indeed, large instances of the line-
service time are most likely to result in instantaneous longer queues and more
interactions between the processors. Under the noncorrelation conditions
ELigasigact- ix, pl = ET) EG) -- EG pd (7.18)
If we denote by Z, the fraction of time spent by each processor waiting for or
holding line &, Eq. 7.17 becomes (1 — Z,)? + p, ¥ = | because of the symmetry
of the system.
On the other hand, since a processor is either in an independent state or in an
interactive state (waiting for or holding one of the lines), then, by the law of total
probability, in a system with / lines we have
f
Y+YV4=1 (7.19)
ket
We use Eq. 7.18 with the condition that 1 — p,¥ > 0, Z,=1—( —p,¥)'”.
Consequently, by the substitution for Z, in Eq. 7.19 and rearranging, we obtain
i
§;, is the mean line-service time and T + D is the mean time between an exit
from an interactive state and a visit to the next interactive state. Note that 5, can be
found as the mean time that a processor spends holding a memory line k. Similarly,
T is found as the mean time spent outside of an interactive state. Y can be solved
by Newton's iterative method given that a unique solution exists for Y between 0
and min(1, 1/p,). Let us illustrate the application of this model to the system
mentioned earlier. For simplicity, assume that for these examples q, = 1/! and
5S, = § for all values of &. Then Eq. 7.20 becomes:
Y¥ =5 1 : _ (*)=")
Y+i—-1)
(7.21)
are made uninterruptedly on the same memory line. In this case, the line that
accepts the memory request is busy for 2b — m+ c time units. Hence the mean
line-service time is:
S = OL + w,) m+ (7.22)
Since 1 — fis the cache-miss ratio, the probability that a given cache cycle requires
an access to memory is x(i — 4). We account for the crossbar switch setup and
traversal times by ¢,. Hence the average time spent in state A is:
ty —— + 1, (7.23)
Since the drain time, D = m — 1, we can determine Y from Eq. 7.21 for the multi-
processor system with set-associative caches. The processor utilization, U, which
is the fraction of time the processor is busy processing instructions is given by
yp — {xa
if ——hA) _
Cc
Since C = (7 + D)/Y, the utilization can be rewritten as
(7.24)
~ x(1 — ANT + D)
We illustrate the set-associative cache example with a multiprocessor system
with p = l6processors,x = 0.4,and¢, = 0 (infinitely fast crossbar). The cache hit
ratio, h = 0.95, and the block size } is allowed to vary. The memory organization
has a fixed number of modules per line (m = 4), but the number of lines vary, Also
the memory module cycle time, c, varies. Figure 7.44 shows the application of
Eqs. 7.21 through 7.24 for the given set of parameters.
This result assumes that the cache size is adjusted to give the same hit ratio
when the block size is varied. An increase in the block size deteriorates the utiliza-
tion, An operating region should be chosen where the utilization isacceptable. Note
that for certain values of 6 and c, small values of / will give high utilization. This
possible reduction in / gives the designer a choice. If for a small number of lines,
/ < 16, the utilization is acceptable, the designer can consider trade-offs with low-
cost multiport memory.
design issues in multiprocessing
7.3.3 Multicache Problems and Solutions learn this
The presence of private caches in a multiprocessor necessarily introduces problems
of cache coherence, which may result in data inconsistency. That is, several copies
of the same data may exist in different caches at any given time. This is a potential
problem especially in asynchronous parallel algorithms which do not possess
explicit synchronization stages of the computation. For example, process 4,
which runs on processor i, produces data x, which is to be consumed by process B,
which runs on processorj ¥ 7 asynchronously. Process A writes a new x into its
cache while process B uses the old value of x in its cache because it is not aware of
WWW.Gitmgurgaon.blogspot.com
S18 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
19
(b,c) =
08 (4.2)
—_—_<“ re en (4,4)
08 ane © (8.2)
eee a we (8,4)
= or (16,2)
ae oe (16,4)
€ os
5
E05
§4
2 oa
=
ae
03
p = 16
h = 0.95
02 x =04
m 24
0.1 t, = 0
I | t
1 2 4 8 16
Number of lines in main memory, f
Figure 7.44 Effect ef block size and main memory speed on processor utilization for the set-asseciative
cache. (Courtesy of JEEE Trans, Coimpurers, Briggs and Dubois, Jan. 1983.)
the new x. Process B may continue to use the old value of x in its cache unless it is
informed of the presence of the new x in process 4’s cache so that a copy of it may
be made in its cache. The possibility of having several processors using different
copies of the same data must be avoided if the system is to perform correctly.
Hence, data consistericy must be enforced in the caches.
Another form of the data consistency problem occurs in a multiprogrammed
multiprocessor system. In this case, a processor usually switches to other processes
at the time of the arrival of external interrupt signals or page fault operations.
If the suspended process migrates to another processor, the most recently updated
data of this process might still be in the original processor’s cache. Hence a process
running on a new processor could use stale data in main memory. The new pro-
cessor cannot recognize the data as stale, and thus would not be working with the
process’s proper context. Such an operation is incorrect and can result in subtle
errors that are difficult to trace. In many multiprocessor systems such as the S-1,
privileged instructions are provided to sweep the cache. The cache sweep operation
is used to deliberately update main memory to reflect any changes in cache
contents.
A system of caches is cohercns if and only if a READ performed by any pro-
cessor / of a main memory location x (which may be cached by other processors)
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING S19
always delivers the most recent value with the same address x. “Most recent” in
this context has a special meaning in terms of a partial ordering of the READs
and WRITES of memory throughout the multiprocessor. However, for an intuitive
understanding of the problem. it is sufficient to think of recency in terms of absolute
time. In these terms, whenever a WRITE is done by one processor ? to a memory
location x, completion of the WRITE must guarantee that all subsequent READs
of location x by any processor will deliver the new contents of x until another
WRITE to x is completed.
The cache coherence problem exists only when the caches are associated with
the processors. Designs have been proposed in which the caches are associated
with the shared memory as shown in Figure 7.45. This avoids the cache coherence
problem. This architecture is good for systems with a small number of processors.
However, the potential gain in speed is then limited by the transmission delays
through the interconnection network and by the conflicts at the caches. This
technique has been shown to be adequate for multiprocessors where each processor
is pipelined and executes multiple independent instruction streams.
Clearly, the cache coherence problem cannot be solved by a mere write-
through policy. If a write-through policy is used, the main memory location is
updated, but the possible copies of the variable in other caches are not auto-
matically updated by the write-through mechanism. When a processor modifies
a data in its cache, all the potential copies in other caches must be invalidated.
“Write-through” is neither necessary nor sufficient for coherence.
OFTEN ASKED QUE
Static coherence check Two different methods have been proposed to solve the
cache coherence problem. The first method, called static coherence check, avoids
Processors @) y wee
Processor/memory
interconnection network
. °
* . Main memory
Oo —O
Figure 7.45 Caches associated with shared memory lines to avoid data inconsistency.
WWW.Gitmgurgaon.blogspot.com
§20 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
If the cycle ratio, ¢,,/f,, 16 large (a typical value is 5), the performance of this scheme
may be quite poor for algorithms with intense sharing, regardless of the cache size.
Moreover, the requests for shared data increase contentions in the interconnection
network and at the memory. The performance of this scheme can be improved by
associating a high-speed memory or cache module with each memory line. This
cache module is used to buffer the noncacheable data, thereby reducing the
effective f,, and hence, the cycle ratio, tu/te .
In a similar scheme which avoids these problems, the shared data is accessed
through a shared cache while instruction fetches and private data references are
made in private caches. Figure 7.46 illustrates this shared cache concept. Notice
that the shared cache may consist of interleaved cache modules which may be
connected to the processors and shared memory through an interconnection
network, However, the complexity of this network is expected to be less than a full
crossbar, All data references proceed at the cache speed except when conflicts
occur at the shared cache or a miss occurs in either a private cache or the shared
cache,
If the hit ratio is high enough in all caches, this scheme alleviates the contention
problem at the main memory. The success of the shared cache concept relies on
the relatively low rate of shared data references. Of course, the shared data may
exhibit less locality than private data. However, the hit ratio in the shared cache
improves as the degree of sharing of the shared variables increases. Indeed, a
processor may find a shared variable in the shared cache even if it never referenced
it before. Moreover, increasing the size of the shared cache is an effective method
to improve the hit ratio.
The shared cache concept requires that data be tagged as private or shared.
The tagging is basically static. Sratic tags are made during compile time and remain
the same throughout the lifetime of the process. Dynamic tags are made during the
execution of cooperating processes, A lookahead mechanism monitors the history
of sharing of the data space in one phase and predicts the probability of sharing of
the subspaces in the next phase. With this scheme, a data subspace could be in the
shared cache for effective sharing in one phase and in the private caches in another
phase, for efficient access, or vice-versa. The overhead may be unacceptable, as
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 521
.
Private [ec] 2
cache .
. s
PC
Private
civate data
da Pi -- leave this
path c Switch j
Progessor/memory Shared
cache
3
interconnection network
ro
=
DMA gate Cc
bud
bud
: Shared memory
.
LC)
Figure 7.46 Multiprocessor system with private and shared data paths.
the caches must be flushed to main memory. Moreover, the migration of data sets
can create constraints on the scheduler or loader. The tagging of data necessitates
the compiler be designed to detect private and shared data. With the advent of
such abstract and block-structured languages as concurrent Pascal, this can be
accomplished by explicit indication of such data sets. It can be argued that the
shared cache concept lacks flexibility.
Dynamic coherence check The second method for solving cache coherence is
more flexible than the static coherence check, but also more complex and possibly
more costly. In this scheme, called dynamic coherence check, multiple copies are
allowed. However, whenever a processor modifies a location x in a cache block,
it must check the other caches to invalidate possible copies. This operation is
referred to as a cross-interrogate (XI). In the most rudimentary imaplementation
of this method, the caches are tied on a high speed bus. When a local processor
writes into a shared block in its cache, the processor sends a signal to all the remote
caches to indicate that the “data at memory address x has been modified.” At
the same time, it writes through memory. Note that, to ensure correctness of
execution, a processor which requests an XJ] must wait for an acknowledge signal
from all other processors before it can complete the write operation. The XI
invalidates the remote cache location corresponding to x if it exists in that cache.
WWW.Gitmgurgaon.blogspot.com
522 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
When the remote processor references this invalid cache location, it results in a
cache miss, which is serviced to retrieve the block containing the updated in-
formation.
For each write operation, (n — 1) XIs result, where n is the number of
processors. When # increases, the traffic on the high-speed bus becomes a bottle-
neck. Moreover, there is a potential for races if the XI requests are queued to
accommodate the peak traffic on the bus. Some commercial multiprocessors with
caches use this technique for a small number of processors. For example, the
Honeywell 60/66 and Univac 1100/80 multiprocessors have cache-invalidate
interfaces between every pair of caches. Note that the two sources of inefficiency
for this technique are the necessity of a write-through policy, which increases the
network traffic, and the redundant cache XIs which are performed. In the latter
case, a cache is purged blindly whether or not it contains the data item x.
A more refined technique filters the XI] requests before they are initiated. Ina
proposed design, the memory control element (MSC) maintains a central copy of
the directories of all the caches, We will elaborate on a similar scheme called the
presence flag technique, which assumes a write-back update policy. There are two
central tables associated with the blocks of main memory (MM) (Figure 7.47).
The first table is a two-dimensional table called the Present table, In this table,
each entry P[i, ¢] contains a present flag for the ith block in MM and the cth
cache, If P[i, c] = 1, then the cth cache has a copy of the ith block of MM, otherwise
it is zero. The second table is the Modified table and is one-dimensional. In this
table, each entry M[i] contains a modified flag for the ith block of MM. If M[i] = 1,
leave it means that there exists a cache with a copy of the ith block more recent than the
this corresponding copy in MM. The Present and Modified tables can be implemented
in a fast random-access memory.
The philosophy behind the cache coherence check is that an arbitrary number
of caches can have a copy of a block, provided that all the copies are identical.
They are identical! if the processor associated with each of the caches has not
attempted to modify its copy since the copy was loaded in its cache. We refer to
such a copy as read only (RO) copy. In order to modify a block copy in its cache, a
processor must own the block copy with exctusive read only (EX) of exclusive
read-write (RW) access rights. A copy is held EX in a cache if the cache is the only
one with the block copy and the copy has not been modified. Similarly, a copy is
held RW in a cache if the cache is the only one with the block copy and the copy
has been modified. Therefore, for consistency, only one processor can at any time
own an EX or RW copy of a block.
To enforce the cache consistency rule, local flags are provided within each
cache in addition to the global tables. A local flag L[k, c] is provided for each
block k in cache ¢. This fiag indicates the state of each block in the cache. A block
in a cache can be in one of three states: RO, EX, or RW. When a processor c
fetches a block i on a read miss, the processor obtains an EX copy of the block,
provided no other cache has a copy of block i and the fetch was for data. In other
fetch misses, the block is assigned RO, as shown in Figure 7.48. The status informa-
tion is recorded in the cache directory and global tables. The status is indicated
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 523
m & %
& & = § a &
aQ 8— =e
y
3
=
m2
oe
& js & oo a S
i> od a >
Cache
block-frame
containing block L1k, OI Ae aad
Cache number 0 1 an
Block 0
oat
oe
Block N~ 1
\
yw wW-
Present table Modified table
Figure 7.47 Organization of flags for dynamic solution te cache coheretice.
in the global table by setting the appropriate present flag and clearing the corre-
sponding modified bit.
As long as the copy of block i remains present in the cache, processor ¢ can
fetch it without any consistency check. If processor ¢ attempts to store into its
copy of block i, it must ensure that all other copies (if any) of block i are invalidated.
To do this, the global table is consulted. It should indicate the processor caches
that own a copy of block i. The modified bit for block i is updated in the global
table to record the fact that processor c owns block i with RW access rights.
Finally, the local L[i, ¢] flag is set to RW to indicate that the block is modified.
The flowchart for a store is given in Figure 7.49.
In this implementation, a block copy in a cache is invalidated whenever the
cache receives a signal from some other processor attempting to store into it.
Moreover, a cache which owns an RW copy may receive a signal from a remote
cache requesting to own an RO copy. In this case, the RW copy’s state is changed
to RO.
WWW.Gitmgurgaon.blogspot.com
524 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Fetch block ¢
in local cache ¢
D fetch
Y I fetch (data)
Y (instruction)
;.
L{i,r] +- RO =
MM ~+— BLOCK[/,-]
Llir|— RO [oat]
Lli,cl +— EX
L{i,c] ~— RG
Store inte
block jin
local cache ¢
Other
copies of yes, one RW copy in cache r
block i
?
yes ‘
: i
Y :Request cache r i Invalidate all RO )
Invalidate all RO ]P! Invahdate Its Copy of copies of block £
copies of block i block é and to write it in other caches
in other caches to MM rm |
a j hi
Record block /
WWW.Gitmgurgaon.blogspot.com
526 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
system is executed by one peripheral processor Py . All the other processors (central
or peripheral) are treated as slaves to Pp. Another example is found in the DEC
System 10, in which there are two identical processors. One of the processors is
designated as master and the other as slave. The operating system runs only on the
master, with the slave treated as a schedulable resource.
Since the supervisor routine is always executed in the same processor, a slave
request via a trap or supervisor call instruction for an executive service must be
sent to the master, which acknowledges the request and performs the appropriate
service. The supervisor and its associated procedures need not be reentrant
since there is only one processor that uses them. There are other characteristics
of the master-slave operating system. Table conflicts and lock-out problems for
system control tables are simplified by forcing a single processor to run the execu-
tive. However, this operating system mode causes the entire system to be very
susceptible to catastrophic failures which require operation intervention to restart
the master processor when an irrecoverable error occurs. In addition to the
mflexibility of the overall system, the utilization of the slave processors may become
appreciably low if the master cannot dispatch processes fast enough to keep the
slaves busy. The master-slave mode is most effective for special applications where
the work load is well defined or for asymmetrical systems in which the slaves have
less capability than the master processor. It is the mode sometimes used if there are
very few processors involved.
When there is a separate supervisor system (kernel) running in each processor,
the operating system characteristics are very different from the master-slave
systems. This is similar to the approach taken by computer networks, where each
processor contains a copy of a basic kernel. Resource sharing occurs at a higher
level, for example, via a shared file structure. Each processor services its own
needs. However, since there is some interaction between the processors, it 1s
necessary for some of the supervisory code to be reentrant or replicated to provide
separate copies for each processor. Although each supervisor has its own set of
private tables, some tables are common and shared by the whole system. This
creates table access problems. The method used in accessing the shared resources
depends on the degree of coupling among the processors. The separate supervisor
operating system is not as sensitive to a catastrophic failure as a master-slave
system. Also, each processor has its own set of input-output devices and files, and
any recontiguration of I/O usually requires manual intervention and possibly
manual switching.
Unfortunately, the replication of the kernel in the processors would demand
a lot of memory which may be underutilized, especially when compared with the
utilization of the shared data structures. A static form of caching could be used to
buffer frequently used portions of the operating system code, while the infrequently
used code could be centralized in a shared memory. Unfortunately, the determina-
tion of which portions of operating system are frequently executed is relatively
difficult to make and is likely to be dependent of the application workload.
The floating supervisor control scheme treats all the processors as well as other
resources symmetrically or as an anonymous pool of resources. This is the most
difficult mode of operation and the most flexible. In this mode, the supervisor
WWW.Gitmgurgaon.blogspot.com
§28 COMPUTER ARCHITECTURE ANTY PARALLEL PROCESSING
routine floats from one processor to another, although several of the processors may
be executing supervisory service routines simultaneously. This type of system can
attain better load balancing over all types of resources. Conflicts in service requests
are resolved by priorities that are either set statistically or under dynamic control.
Since there is a considerable degree of sharing, most of the supervisory code must be
reentrant. In this system, table-access conflicts and table lock-out delays cannot be
avoided. It is important to contral these accesses in such a way that system integrity
is protected. This mode of operation has the advantages of providing graceful
degradation and better availability of a reduced capacity system. Furthermore, it
provides true redundancy and makes the most efficient use of available resources.
Examples of operating systems that execute in this mode are the MVS and VM in
the IBM 3081 and the Hydra on the C.mmp.
Most operating systems, however, are not pure examples of any of the three
classes discussed above, The only generalization that is possible is that the first
system produced is usually of the master-slave type and the ultimate being sought
is the floating supervisor control. In Table 7.3, we summarize the major charac-
teristics, advantages, and shortcomings of the above three types of operating
systems for multiprocessor computers.
when the processor designated as the master has a failure or irrecoverable ecror.
Tdle time on the slave system can build up and become quite appreciable if the master cannot execuie
>
defined or for asymmetrical systems in which the slaves have less capability than the master processor.
WWW.Gitmgurgaon.blogspot.com
§30 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
processes, delayed (at least logically) only when it needs to wait to interact with
one or more other processes. Hence, a parallel program can be said to consist of
two or more interacting processes.
The potential of multiprocessing is achieved by enhancing its capability for
parallel processing. Parallel processing can be indicated in a program explicitly or
implicitly. For explicit parallelism, users must be provided with programming
abstractions that permit them to indicate explicit parallelism when desired in a
program. Implicit parallelism is detected by the compiler. In this case, the compiler
scans the source program and recognizes the program flow. From this flow graph
and other conditions, it detects nontrivial units of program statements which may
be identified as a process. Some of these units may be independent and can be run
concurrently with other processes.
In a multiprocessor system, synchronization takes on increased importance
as it could create too high a penalty. This could significantly degrade system per-
formance if the synchronization mechanisms are not efficient and the algorithms
that use them are not properly designed. In some processors, the synchronization
primitives are not implemented directly in hardware or microcode. Therefore,
software alternatives must be provided. For example, the PDP-11 processors used
for the C1mmp have been implemented with the semaphore-synchronization
primitive in software, thereby taking a significant number of instructions. In an
environment where processes need to synchronize often, this may be a major
bottleneck.
Program-control structures are provided to aid the programmer im developing
efficient parallel algorithms. Three basic nonsequential program-control structures
have been identified. These control structures are characterized by the fact that the
programmer need only foctis on a small program and not on the overall control of
the computation. The first example is the message-based organization which was
used in the Cm* operating system. In this organization, computation is performed
by multiple homogeneous processes that execute independently and interact via
messages. The grain size of a typical process depends on the system.
The second example of a control structure is the chore structure. In this
structure, all codes are broken into small units. The process that executes the unit
of code (and the code itself) is called a chore. An important characteristic of a chore
is that once it begins execution, it runs to completion. Hence, to avoid long waits,
chores are basically small. They have relatively very little input and they reference
only a few different objects. Moreover, they do not block and are noninterruptible.
As part of its output, one chore might request the execution of a small set of
additional chores. Examples of systems that use this structure are the Pluribus
and the BCC-500.
Consider the memory-management portion of the operating system, which
controls swapping between the main memory and a fixed-head disk. Sample
chores may include (a) the disk command to request the transfer of a page of data
between the disk and the memory, and (b) acknowledging completion of a disk-
sector transmission and arranging for any subsequent action.
The third nonsequential control structure is that of production systems, now
ofien used in artificial intelligence systems. Productions are expressions of the
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 531
l. The protection of the resources of one process from willful or accidental damage
by other processes
2. The provision for communication among processes and between user processes
and supervisor processes
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 533
The goal of protection is to ensure that data and procedures are accessed
correctly. When two or more processes wish to access a set of resources within
the multiprocessor system, it is necessary to allocate the resources in such a way
that the total resources of the system are not exceeded. Furthermore, if it is possible
for a process to acquire a portion of the resources that it requires and then subse-
quently make a request for more, it is necessary to ensure that future demands can
always be satisfied. As an example, suppose that the system has one card reader
and one printer. Process one requests the card reader and process two requests the
printer. If processes one and two subsequently request the printer and card reader,
respectively, then the system is in a state where two processes are blocked in-
definitely. This situation is called deadlock or deadly embrace. Methods for detecting
and preventing deadlock, protection schemes and communication mechanisms
for multiprocessor systems will be discussed in Chapter 8.
WWW.Gitmgurgaon.blogspot.com
534 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
execution of the FORK A, J statement causes the same action as FORK A and
also increments a counter at address J. FORK A, J, N causes the same action as
FORK A and sets the counter at address J to N. In all usages of the FORK state-
ments, the corresponding JOIN statement is expressed as JOPN J. The execution
of this statement decrements the counter at J by one. If the result is 0, the process
at address J + 1 1s initiated, otherwise the processor executing the JOIN state-
ment is released. Hence, all processes execute the JOIN terminals, except the very
last one.
Application of these instructions for the control of three concurrent processes
is shown in Figure 7.50. These instructions do not allow a path to terminate
without encountering a junction poimt. The problem with FORK and JOIN is
Process 0
Fork A, J, 3
Process 0 Fork B
8B
j
§ I
Join J Join I
r
J oe —o-+
J+1
Process i
ie§0,1,23
Figure 7.50 Conway's fork-join concept. (Courtesy of AFIPS Press #JCC Prec., 1963.)
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 535
that, unless it is judiciously used, it blurs the distinction between statements that
are executed sequentially and those that may be executed concurrently, FORK
and JOIN statements are to parallel programming what the GO TO statement
is to sequential programming. Also, because FORK and JOIN can appear in
conditional statements and loops, a detailed understanding of program execution
is necessary to enable the parallel activities. Nevertheless, when used in a disciplined
manner, the statements are practical to enable parallelism explicitly. For example,
FORK provides a direct mechanism for dynamic process creation, including
multiple activation of the same program text.
An equivalent extension of the FORK-JOIN concept is the block-structured
language originally proposed by Dijkstra. In this case, each process in a set ofn
processes S,, S;,---5S,, can be executed concurrently by using the following
cobegin-coend (or parbegin-parend) constructs:
begin
So:
cobegin S,;5,;...S_; coend (7.25)
S n+17
end
The cobegin declares explicitly the parts of a program that may execute con-
currently. This makes it possible to distinguish between shared and local variables,
which in turn makes clear from the program text the potential source of interference.
Figure 7.51 illustrates the precedence graph of the concurrent program given
above. In this case, the block of statements between the cobegin-coend are executed
concurrently only after the execution of statement Sp. Statement S,,,, 1s executed
only after all executions of the statements $,. 5,,..., 8, have been terminated.
Since a concurrent statement has a single entry and a single exit, it is well suited
WWW.Gitmgurgaon.blogspot.com
§36 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
begin
So:
cobegin
S,
begin S,; cobegin S,:S,;5,: coend S,; end (7.27)
Sy
coend
S,:
end
Figure 7,53 shows the realization of a paralfe/ for (parfor) statement using
these instructions. Notice that statement S is executed for each value of i, and
that this scheme is independent of the number of processors available in the system.
Moreover, there is no need to state explicitly the relationship between the AND
and JOIN primitives, Consider the matrix computation C — A - B, where A is
WWW.Gitmgurgaon.blogspot.com
§38 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Al
iz 0
PREP
5
.
.
j y A3
*.
IFi>n
End GO TO A6
y Ad
AND(A2)
¥
. AS
BEGIN
5
END for#
A6
JOIN
Figure 7,53 Realization of parallel for statement using defined primitives. (Courtesy of 4C:M Computing
Surveys, Baer 1973.)
ann x matrix and Band C are » x | column vectors, for a very large n. The
algorithm to compute the matrix C is given below using the parfor statement to
spawn p independent processes. Assume that p divides n and a/p = s:
Each process being spawned computes the statements between the outermost
begin-end constructs for a different value of i. Hence, the computation of each
group of CQ) is done concurrently. Concurrent processes that access shared vari-
ables are called communicating processes, When processes compete for the use of
shared resources, common variables are necessary to keep track of the requests
for service.
A very common problem occurs when two or more concurrent processes
share data which is modifiable. If a process is allowed to access a set of variables
that is being updated by another process concurrently, erroneous results will
oceur in the computation. Therefore, controlled access of the shared variables
should be required of the computations so as to guarantee that a process will
have mutually exclusive access to the sections of programs and data which are
nonreentrant or modifiable. Such segments of programs are called critical sections.
The following assumptions are usually made regarding critical sections:
var v: shared V:
var w: shared W;
cobegin (7.29)
csect v do P:
csect wdoQ;
coend
WWW.Gitmgurgaon.blogspot.com
540 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
csect vy do
begin
Les (7.30)
csect w do S;
end
cobegin
P,: csect v do csect w do S,;
(7.31)
P.: csect w do csect rv do S,;
coend
When process P, tries to enter its critical section w, it will be delayed because P,
is already inside its critical section w. And process P, will be delayed trying to
enter its section v because P, is already inside its section v.
The deadlock occurs because two processes enter their critical sections in
opposite order and create a situation in which each process is waiting indefinitely
for the completion of a region within the other process. This circular wait is a
condition fer deadlock. The deadlock is possible because it is assumed that a
resource cannot be released (preempted) by a process waiting for an allocation
of another resource. From this technique, an algorithm can be designed to find
a subset of resources that would incur the minimum cost if preempted. This
approach means that, after each preemption, the detection algorithm must be
reinvoked to check whether deadlock still exists.
A process which has a resource preempted from it must make a subsequent
request for the resource to be reallocated to it. As an example, we consider a
system in which one process produces ahd sends a sequence of data items to
another process that receives and consumes them. It is an obvious constraint that
these data items cannot be received faster than they are sent. To satisfy this require-
ment, it is sometimes necessary to delay further execution of the receiving process
until the sending process produces another data item. Synchronization is a general
term for timing constraints of this type of communication imposed on interactions
between concurrent processes.
The simplest form of interaction is an exchange of timing signals between
two processes, A well-known example is the use of interrupts to signal the comple-
tion of asynchronous peripheral operations to the processor. Another type of
timing signals, events, was used in early multiprocessing systems to synchronize
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 541
concurrent processes. When a process decides to wait for an event, the execution
of its next operation is delayed until another process signals the occurrence of the
event.
The following program illustrates the transmission of timing signals from one
process to another by means of a shared variable e of type event. Both processes
are assumed to be cyclical. Notice that the concurrent operations wait and signal
both access the same variable e.
WWW.Gitmgurgaon.blogspot.com
542 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Data dependency
is the main factor for the interprocess detection of parallelism.
Consider several statements 7, ofa sequentially organized program illustrated in
Figure 7.55a. If the execution of statement T, is independent of the order in which
statements T, and T; are executed (Figure 7.54a, b), then parallelism is said to exist
between statements T, and T,. They can, therefore, be executed in parallel. as
shown in Figure 7.55¢. This commutativity is a necessary but not a sufficient condi-
tion for parallel execution. There may exist, for instance, two statements which
can be executed in either order but not in parallel. For example, an FFT computa-
tion produces its output in a scrambled order (bit reversed) as shown in Chapter 6.
Therefore, there are two ways to perform FFT computations as shown below:
1. Method one
a. Bit reverse the input.
b. Perform the FFT.
2. Method two
a. Perform the FFT.
b. Bit reverse the output.
Thus performing the FFT and bit-reversal operations are two distinct processes
which can be executed in alternate order with the same result. They cannot,
however, be executed in parallel.
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 543
wl pe ---
ee---
at
ej
ij
ee
a
L
1
t, tr,
.3
' {
(a) (d) (c)
Figure 7.55 Sequential and parallel execution of a computational process.
1. The read set I; represents the set of all memory locations for which the first
operation in T; involving them is a fetch.
2. The write set O; represents the set of all locations that are stored into in 7).
The conditions under which two sequential processes T, and T; can be executed
as two independent and concurrent processes is given below:
1,n0,=¢ (7.33)
hno0,=¢ (7.34)
Furthermore, to maintain the state of the machine (the contents of the total
memory locations) when entering T, independently of the mode (parallel or
sequential) of execution of 7, and 7}, 7; must be independent of the storing
operations in 7, and 7), that is,
WWW.Gitmgurgaon.blogspot.com
544 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
It is based on the above conditions that systems have been written for auto-
matic detection of parallelism in source programs written in high-level languages.
But the granularity of each of the processes created is usually small, At this point,
it is desirable to clarify some possible misinterpretations of the implications
of this method. The methed does not try to determine whether any or all of the
iterations within a loop can be executed simultaneously. Rather, the iterations
executed sequentially are considered as a single task. Given a Pascal FOR state-
ment, it is possible to detect if all executions of the loop must be performed se-
quentially or all of them can be executed simultaneously.
The total replication test can be approached at different levels of sophistication.
Let L = {S,,...,5,,...,5,} be the statements com osing the FOR loop. Then one
can form the following input and output sets: J, = 42, 1,0.= U5., O,;, where
I, and O, are the input and output sets formed with variables referenced within L,
with each subscripted array being an individual entry. HJ, 1 0,= ¢, then
all loop iterations can be replicated, for example:
fori — 1 until do
begin
A(i) — B(i);
ca) + Di); (7.37)
end
end
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 545
But iff, 7 O, # @, then one can look at the variables for which conflicts arise,
If those are set before they are used, then the conflict is artificial and replication
is permissible as, for example, in
fori — 1 until np do
begin
A(i) — f{ACi)); 3
T = g(A(i)): v8)
Bu) — A(T);
end
where a different T could be set aside for each replication. In practice, a compiler
which incorporates an intelligent recognizer of parallelism with sufficient granu-
larity is very difficult to implement. It is still a research problem. The recognizer
often represents an overhead which may not be cost-effective for analyzing certain
classes of programs in determining their parallel processability. The benefits of
parallel processing obtained by using the recognizer will accrue only if the program
is run many times in order that the initial overhead may be distributed over the
many runs of the program.
WWW.Gitmgurgaon.blogspot.com
§46 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
access times, the memory bandwidth and its capacity. Depending on the archi-
tectural features, different aspects of the system may become the main bottlenecks
in achieving a high level of concurrency.
The effective utilization of many multiprocessing computers is limited by the
lack ef a practical methodclogy for designing application programs to run on
such computer systems. A major technique is to decompose algorithms for parallel
execution. Two major issues in decomposition can be identified as partitioning
and assignment. Partitioning is the division of an algorithm into procedures,
modules and processes. Assignment refers to the allocation of these units to
processors. These problems are among the most difficult and important in parallel
processing. The emphasis below is to include communication factors as criteria
for partitioning. Assignment techniques will be discussed in Section 8.4.
We will consider an example of an algorithm written for a uniprocessor and
investigate the decomposition and restructuring of this algorithm for a multi-
processor system. The example is an image-processing algorithm called histo-
gramming. A typical picture is represented by a rectangular array of picture
elements (pixels), Each pixel has a small integer value (8 bits) between 0 and b — |
(inclusive) that represents a gray scale value of a black-and-white picture or, for
color images, the intensity of a primary color. Typically, 2 < 6 < 256. Histo-
gramming involves keeping track of the frequency of occurrence of each gray
scale value; thus, for 8-bit pixels, 6 such frequencies (simple counters) must
be maintained. Let histog [0:b ~ 1] represent the array that keeps the count for
the number of occurrences of the gray scale value 0, 1,..., — 1. The rectangular
picture is represented by the two-dimensional array of picture elements pixel
[Q:m — 1,0:1 ~ 1), with m rows and n columns. If pixel [i,7] represents the gray
scale value of the pixel at coordinate i, , then the following serially coded program
will update the histogram to include the pixel at i, 7:
is thus p. In general, the value of p affects the performance of the algorithm. Using
the parfor statement, the parallel algorithm to histogram the image may be written
as follows:
IMAGE histog
pixel[O:7— 1, O:a— 1)
l Pm bn ” |o
eo
25-1
m—1
WWW.Gitmgurgaon.blogspot.com
548 COMPUTER ARCHTTECTURE AND PARALLEL PROCESSING
Also, the updating of each “histeg” counter must be done in a mutually exclusive
manner to avoid incorrect results. Hence, the counter update statement must be
enclosed as a critical section. The degree of decomposition influences the degree
of contention in this case. Without memory contention, the potental time com-
plexity ofeach process is O(sn). Since s = m/p, the decomposition of the problem into
p processes has a potential speedup of p. This speedup is never achievable in
practice, however.
We illustrate the effect of the placement of the process code, picture segments,
and bins by considering the execution of this algorithm on three different multi-
processor architectures, each with p processors. In the first case, each of the p PEs
consists of a processor and its local memory, which is attached to a time shared
bus. In addition, a main memory, which is also attached to the bus, is shared by
all the processors. Each PE contains the process code and its segment of the picture.
The b bins are stored in the main memory. Since each PE contains a segment of
the picture, the processors access the pixels without conflict. However, this
architecture will cause excessive conflicts to the main memory because of con-
current accesses to the bins.
In the second case, the processors share the main memory through a crossbar
switch. The process code and the entire picture elements are in memory. The bins
are distributed across the memory modules. Therefore, in addition to conflicts of
accesses to the process code and pixels, there are also conflicts of accesses to the
bins. .
In the third case, each processor of the second case has a private cache. Assume
that the process code is small enough to reside in the cache; hence, we have faster
access to code. Also assume that the cache is not large enough, so that the blocks of
read only pixels are fetched into the cache on demand. The pixels are thus accessed
at slightly faster than main memory speeds. However, the bins are shared writeable
memory locations. Therefore, accesses to them cause excessive “ ping-ponging”
as copies of these bins are bounced from one cache to the other because of refer-
ences for updates.
The effect of ping-ponging is considered by accesses to the bin histog[’] in a
cache with write-back memory update policy. Suppose a remote cache C, has the
latest copy of bin &, which is now referenced by a local cache C,. The reference in
C; results in a miss and causes C, to update memory copy of bin & and also in-
validate C's copy of bin k. Furthermore, C; gets the copy of bin & from memory and
increments it. In effect, processor / waits for two memory cycles each lime a reference
to a bin results in a miss when another processor has a copy of it. A subsequent
reference by another processor to this bin ping-pongs the bin to the processor's
cache.
Lf the degree of decomposition is greater than the number of available pro-
cessors, the processing time of the histogramming preblem is as slow as the
completion time of the last of the p processes. The memory contention and mutual
exclusion problems could be eliminated if each process has ) bins to generate its
local histogram for its segment of the picture in its local memory. The algorithm
below illustrates the modification required to eliminate these problems.
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 549
WWW.Gitmgurgaon.blogspot.com
$80 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
WWW.Gitmgurgaon.blogspot.com
§52 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
7.2, Assume a uniprocessor with cache. The main memory consists of 16 modules which can be inter-
leaved in various ways. For each case below, indicate the number of memory cycles required pet block
transfer. Also indicate the reliability of each system by an integer &. A system is said to be k-reliable with
respect to memory if it can keep functioning after k modules taken at random are disconnected. In the
table, LSB means the least significant bit and MSB the most significant bit.
Number of
Taterleaving Block size Bus width memery cycles Reliability
7.3 The functions S,_,(/) and {,,,(/) from the Delta networks are defined as follows:
and
' ‘4 A
t Y
q j
7 7 ¥ 4
Arbiter
WWW.Gitmgurgaon.blogspot.com
584 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
7.6 Consider a p x mt crossbar switch connecting p processors to m memory modules. Assume only
one input AND gate and OR gate (no wired-OR). Assume also that all variables are available in true
and complemented form.
(a) Estimate the number of gates in the switch, ignoring the decoders and the arbiter. Assume
the data width to be w bits.
(b) Design the decoder and arbiter for the above crossbar. Assume that the processor P, has
priority over P, if i <j. Estimate the number of gates for this circuit.
7.7 We have considered an 8 x 8 delta network jn Figure 7.35. Answer the following questions related
to this interconnection network.
(a) Does the network have a path between any processor and any memory module?
(b) Let (dzdydy), be the address of a memory module (MM) generated by a processor whose
number is (ps pypg)2. Let the control variables at stages 6, 1, and 2 be xq, x,, and x» respectively. The
convention is
° for straight connection
xX, =
1 for crossed connection
The requested MM address is passed through the successive stages to set up the path. Find the logic
equations for x», x), X2 a8 functions of dy, d,, d,, and po, py, and py.
(c) Assume that processor zero accesses MM2, processor four accesses MM4, and processor six
accesses MM3, Show the paths for these three requests on Figure 7.35. Do these requests conflict?
7.8 Briefly characterize the nuticuche coherence problem and describe various methods that have been
suggested tc cope with the problem. Comment on the advantages and disadvantages of each method
to preserve the coherence among multiple shared caches used in a multiprocessor system.
7.9 Distinguish among the following operating system configurations for multiprocessor computers.
Jn each system configuration, name two example multiprocessor computers that have implemented
an operating system similar to the configuration being discussed. Comment on the advantages, design
problems, and shortcomings in each operating system configuration.
(a) Master-slave operating system.
(b) Separate supervisor system per processor.
(c) Floating-point superviser system.
7.10 Consider a computer with four processors P,, P,, Py, P, and six memory modules M,,
M,...., Mg. The four processors can be configured as an MIMD machine or as an SIMD machine.
The illustrated memory access patterns are generated by the processors for the computation of six
instructions with the data dependence graph shown in Figure 7,57. Each instruction needs to access
the memory modules in, at most, three consecutive memory cycles. For SIMD mode, only the same
instruction can access different modules simultaneously. For MIMD mode, no such restriction exists.
When two or more processors access the same module in the same cycle, the request of the lower-
numbered processor is granted, and the rest of the requests must wait for a later memory cycle.
(a) For MIMD operation, different instructions can be executed by the four processors at the
same time, subject only to data dependency, What is the average memory bandwidth (words/cycle) used
in the execution of the above program in MIMD mode?
(b} Repeat the same question for using the computer in SIMD mode. The four processors must
execute the same instruction at the same time under SIMD mode.
7.1 Supposing each task is an assignment statement, restructure the following assignment statements
using Bernstein’s conditions so that we have maximum parallelism among tasks. Specify which of the
three conditions you are using for the restructuring.
A= Bt
C=D+E
FeG+E
C=L4M
MxeG@G+teC
MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING 553
Processors
P, P, P, Py
Instructions
4 M, M, M, M,
‘ M, M, M, M,
i M, M, M,; M,
‘, M, M, M, M,
i, M, M, M, M,
', My M, Ms M,
Figure 7.58 Program graph and the memery access pattern for either an SIMD computer with 4 PEs or
an MIMD multiprocessor with 4 processors in Problem 7.10.
WWW.Gitmgurgaon.blogspot.com
§56 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Write your answer so that, when two assignment statements are put on the same line, it indicates that
they can be processed in parallel. It is easier if you reset the order of precedence in a precedence graph.
(Note: The statements executed after the restructuring should have the same result as the statements
executed before restructuring.)
7.12 A parallel computation on an a-processor system can be characterized by a pair <P(n), Tin),
where P(s) is the total number of unit operations to be performed and T(n) is the total execution time
im steps by the system. In a sertad computation on a uniprocessor with n= 1, one can write T(1) = PC),
because each unit operation requires one step to be executed. In general, we have T(n) < P(x), if there
is more than one operation to be performed per step by n processors, where n > 2. Five performance
indices have been suggested below by Lee (1980) in comparing a paraliel computation with a serial
computation.
TQ)
S(a) = Tin) (The speedup)
TH
E(n) = h. Tin)
.
(The efficiency)
Pon)
Ria) = Pa) (The redundancy)
Ta
Qtr) = ne T?(n)+© P(n) (The quality)
(a) Prove that the following relationships hold in all possible comparisons of parallel to seria]
computations.
Q)1<Sa@) <n
(2) Bay = SI
S(4) - E(n)
(4) Qt) = Ron)
EIGHT
MULTIPROCESSING CONTROL AND
ALGORITHMS
cooperating process. The set of constraints on the ordering of these events con-
stitutes the set of synchronization required for the operating processes. The
synchronization mechanism is used to delay execution of a process in order to
satisfy such constraints.
Two types of synchronization are commonly employed when using shared
variables. These are mutual exclusion and condition synchronization, We recall
that mutual exclusion ensures that a physical or virtual resource is held indivisibly.
Another situation occurs in a set of cooperating processes when a shared data
object is in a state that is inappropriate for executing a given operation. Any
process which attempts such an operation should be delayed until the state of the
data object changes to the desired value as a result of other processes being
executed. This type of synchronization is sometimes called condition synchron-
ization. The mutual-exclusive execution of a critical section, S, whose access is
controlled by a variable gate can be enforced by an entry protocol denoted by
MUTEXBEGIN (gate) and an exit protocol denoted by MUTEXEND (gate).
Alternatively, the effect of the entry and exit protocols can be expressed as csect
gate do S.
There are certain problems associated withimplementing the MUTEXBEGIN/
MUTEXEND construct. Execution of the MUTEXBEGIN statement should
detect the status of the critical section. If it is busy, the process attempting to
enter the critical section must wait. This can be done by setting an indicator to
show that a process is currently in the critical section. Execution of the MUTEX-
END statement should reset the status of the critical section to idle and provide
a mechanism to schedule the waiting process to use the critical section (CS).
One implementation is the use of the LOCK and UNLOCK operations to
correspond to MUTEXBEGIN and MUTEXEND respectively. For these,
consider that there is a single gate that each process must pass through to enter
a CS and also leave it. If a process attempting to enter the CS finds the gate un-
locked (open) it locks (closes) it as it enters the CS in one indivisible operation so
that all other processes attempting to enter the CS will find the gate locked. On
completion, the process unlocks the gate and exits from the CS. Assuming that the
variable gare = 0(1) means that the gate is open (closed), the access to a CS
controlled by the gate can be written as
LOCK (gate)
execute Critical section
UNLOCK (gate)
The LOCK (x) operation may be implemented as follows:
var x: shared integer;
LOCK (x); begin
var y: integer;
yoX;
while y = 1 doy < x; // wait until gate is open //
x« 1; // set gate to unavailable status //
end
MULTIPROCESSING CONTROL AND ALGORITHMS 359
UNLOCK(x): x « 0;
An important property of locks is that a process does not relinquish the processor
on which it is executing while it is waiting for a lock held by another process.
Thus, it is able to resume execution very quickly when the lock becomes available.
However, this property may create problems for the error-recovery mechanism of
the system when the processor which is executing the lock fails. The error-recovery
procedure has to be sophisticated enough to ensure that deadlocks are not in-
troduced as a result of the recovery process itself,
Another instruction used to enforce mutual exclusion of access to a shared
variable in memory location addr is the compare-and-swap (CAS) instruction.
This instruction is available on the IBM 370/168. A typical syntax of this instruction
uses the two additional operands r_old and r_new, which are processor registers
WWW.Gitmgurgaon.blogspot.com
560 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
(CAS r_old, r_new, m_addr). The action of the CAS instruction js defined as
follows:
Notice that associated with the CAS instruction is a processor flag z. The flag is
set if the comparison indicates equality. Again, the execution of the CAS instruction
(that is, the IF statement) is an indivisible operation. We illustrate the use of the
CAS instruction with a shared singly linked queue data structure (Figure 8.1),
which is accessed concurrently by the two processes P, and P,. The two operations
which can be performed on the queue are ENQUEUE(X) and DEQUEUE.
ENQUEUE(X) adds a node X to the “TAIL” of the queue and DEQUEUE
returns a pointer to the deleted “HEAD” of the queue. HEAD and TAIL are
shared global variables. Assuming that the queue is never empty (for simplicity),
the ENQUEUE(X) primitive for a nonconcurrent system can be described as
Procedure ENQUEUE(X);
var P: pointer; //P is local to each invocation //
begin
LINK(X) — A; //terminate last node’s link//
P — TAIL;
TAIL « X;
LINK(P) — X; //attach new node to queue//
end
Process Process
a P,
Head Tail x
2 > ee ee A A
P
Head x y
e > oe A A
P Vi Tait
Head Tail= 4 Y
“repeat CAS P, X, TAI Luntil TAIL = X"’. The modified ENQUEUE(X) primitive
is shown below:
Procedure ENQUEUE(X):
var P: pointer;
begin
LINK(X) — A;
P — TAIL;
repeat CAS P,X,TAIL until TAIL = X;
LINK(P) — X;
end
The CAS instruction ensures that the logical state (P) of the interrupted
program is maintained on resumption of the interrupted program. Otherwise
it updates the state P to the most recent value of TAIL.
WWW.Gitmgurgaon.blogspot.com
562 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Figure 8.1b shows the outcome of the execution of the primitive by P, followed
by the completion of the execution of the primitive by P, (Figure 8.1c). The CAS
instruction is more useful than the test-and-set instruction. An extension of the
CAS instruction is the compare double and swap, also available on the IBM 370/168.
There are other variations which enforce mutual exclusion. For example, the
Honeywell 60/66 has the load-accumulator-and-clear-memory-location (LDAC)
instruction.
The LOCK instruction using TAS has a drawback in that processes attempting
to enter critical sections are busy accessing and testing common variables. This
is called busy-wait or spin-lock, which results in performance degradation. The
process cannot normally be context-swapped off its processor while it is waiting.
Hence, the processor is said to be locked out. Such lock-out is only permitted in
supervisor mode. In general, LOCK and UNLOCK primitives are not usually
allowed to be executed in user mode because the user process may be swapped
out while holding a critical section. On the other hand, if the user makes a super-
visor call each time it attempts to access a critical section, the overhead will be
greatly increased. Hence the CAS instruction was provided as an excellent mechan-
ism of letting the user do some synchronization in user mode.
The performance degradation due to spin-locks is two-fold. When a processor
is spinning, it actively consumes memory bandwidth that might otherwise have
been used more constructively. If the spinning period is too long, a processor is
not effectively utilized during that period. A number of methods have been proposed
to reduce the degradation due to spin-locks. The first method is aimed at reducing
the request rate to memory and, hence, the degree of memory conflicts. This is
accomplished by delaying the reissuance of the lock request for an interval T.
Thus, the LOCK (x) primitive, for example, can be modified as
LOCK(x): begin
y — TEST-AND-SET(x};
while y #0 do
begin
PAUSE(T}; //
y — TEST-AND- SET(x};
end
end
Note that the processor issuing the request may not be released unless T is large
enough. The choice for T depends on the granularity of the resource being re-
quested.
The second method is directed at relieving the processor of performing the
lock access by incorporating a separate mechanism which processes lock requests.
This can be accomplished in one of several ways. For example, the mechanism
can continuously access the lock until it is available, as in the HEP machine.
MULTIPROCESSING CONTROL AND ALGORITHMS 563
wake_up (p): if process p is logically blocked (that is, dormant}, change its state
to active; else set up a wake-up waiting switch (wws) to remember the wake-up
call
block (v}: if process p’s wws is set, reset it and continue execution of the process;
else change p’s state to dormant
WWW.Gitmgurgaon.blogspot.com
564 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
a
f a
P,
Switch
Register
Yy F&A(K, e+e}
y
Memory
module
Figure 8.2 Implementation of the
Fetch-and-Add primitive (F & A}.
MULTIPROCESSING CONTROL AND ALGORITHMS 565
var s: semaphore
WWW.Gitmgurgaon.blogspot.com
566 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
section; that is, it acts to acquire permission to enter. The V(s) primitive is the
MUTEXEND and records the termination of a critical section:
Current | Console
directory on:
(a) Command: fs
Current
Printer
directory
WWW.Gitmgurgaon.blogspot.com
568 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Jill variable informs the consumer of the number of messages needed to be con-
sumed. The concurrent program below illustrates the actions of the producer and
consumer processes. The producer or consumer will be suspended when empry = 0
or full = 0, respectively.
Example 8.1
shared record
begin
var p, c: integer;
var empty, full: semaphore;
var BUFFER [0:n — 1]: message;
nd
initial empty = n, full = 0, p = 0, c = 0;
cobegin
Producer: begin
var m: message;
Cycle
begin
Produce a message m;
P(empty);
p< (p +1) mod n;
BUFFER [p] <— m; // place message in buffer//
V(full)
end
end
Consumer: begin
var m: message;
Cycle
begin
P (full);
cH (¢+ 1) mod n;
m<« BUFFER [c]; // remove message from buffer //
V (empty);
Consume message m:;
end
end
coend
The P and ¥ operations may be extended for ease of problem formulation and
clarity of solutions. The extended primitives PE and VE developed by Agerwala
(1977) are indivisible and each operates on a set of semaphores which must be
initialized to nonnegative values.
MULTIPROCESSING CONTROL AND ALGORITHMS 569
There is no association between s, and 5;. The 5; symbol is used for convenience
to represent the semaphore s, where j > n. The following examples are used to
illustrate the application of the extended primitives.
WWW.Gitmgurgaon.blogspot.com
3570 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
cobegin
Process X: begin PE(r,, rp); use resource & and P; VE(rg, fo); end
Process Y: begin PE(r,, r7); use resource & and 7; VE(rp, r;); end
Process Z: begin PE(r,, r,); use resource P and 7; VE(rp, ry): end
coend
Processes
Devices
based on the priorities. A request by a process is not honored until all higher
priority requests have been granted. The resource is used nonpreemptively:
Note that the requesting process cannot be blocked on PE(s;) since a VE(s;}
was executed earlier to register request. This example may be used in servicing
prioritized interrupts. In this case the processes represent the interrupts
and R represents the processor which services the interrupts.
vars, u, R: semaphore
initial s = u = 0;
initial R = 1;
Supervisor: begin User: begin
VE(s); VE(u):
PE(R); PE(R, s):
PE(s); PE (u);
Use Resource: Use Resource;
VE(R): VE(R):
end end
Notice that the constructs differ mainly in the second PE statements. Since the
supervisor process is of a higher priority than the user process, it only checks to
see if the resource is available [PE(R)], whereas the user process also checks to
see if there is an outstanding request from the supervisor [PE(R, s)]. Since we are
considering simultaneous execution of the user and supervisor codes, the execution
of the PE(R, s) statement will find s = 1, which was set by the VE(s)} operation in
the supervisor process. Hence, the user process will be blocked until the resource
is released by the supervisor.
Although, semaphores can be implemented using locks, they are more
commonly accessed by system calls to the supervisor. The supervisor maintains two
sets of lists or queues: blocked and ready. Descriptors for processes that are
blocked on a semaphore are added to a block queue associated with that sema-
phore. For the generalized P and Y, the set of blocked queues may be quite complex.
However, execution of a PE or VE operation causes a trap to a supervisor routine
which completes the operation. The ready list contains descriptors of processes
that are ready to be assigned to a processor for execution. In a multiprocessor,
with master-slave operating system, a single processor may be responsible for
maintaining the ready list and assigning processes to the slave processors. The
ready list may be shared in a multiprocessor with a distributed supervisor. In
this case, the ready list may be accessed concurrently. Therefore, mutual exclusion
must be ensured and can be accomplished by spin-locks, since enqueue and
dequeue operations are fast on the ready list. Moreover, a processor that is at-
tempting to access the ready list cannot execute any other process.
WWW.Gitmgurgaon.blogspot.com
572 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Semaphores are quite general and can be used to program almost any kind of
synchronization. However, the use of the P and V primitives ina parallel algorithm
makes the algorithm rather unstructured and prone to error. For example, omitting
a P or V, or accidentally invoking a P on one semaphore and a V on another can
have disastrous effects, since mutual exclusion would no longer be ensured.
Aiso, when using semaphores, a programmer can forget to include in critical
sections all statements that reference the shared modifiable cbjects. This, too,
could cause errors in execution. Another problem with using semaphores is that
both condition synchronization and mutual exclusion are implemented using the
same pair of primitives. This makes it difficult te identify the purpose of a given
P or ¥ operation without a detailed trace of other elects on the semaphore.
The variable v is used to name a given resource. The global state of a system of
processes is determined by the values of the shared variable v and the program
counters of the single processes. The variables in v may only be accessed within
CCS statements that name v. A CCS statement is of the form
csect v do await C:S
where C is a boolean expression and S is a statement list. Note that variables local
to the executing process may also appear in the CCS statement:
A CCS statement delays the executing process until the condition, C, is true;
S is then executed. The evaluation of C and execution of S are uninterruptible
by other CCS statements that name the same resource. Thus, C is guaranteed to
be true when execution of S begins. Mutual exclusion is provided by guaranteeing
that execution of different CCS statements, each naming the same resource but
not overlapped. Condition synchronization is provided by explicit boolean con-
ditions in CCS statements.
We illustrate the use of the conditional critical sections by two applications.
The first example is a solution to the producer-consumer problem. Assume that
the two classes of processes (producers and consumers) communicate via a
bounded circular buffer as in Figure 8.4. Access to this buffer must be mutually
exclusive. Seven shared variables which are associated with thé critical section v
are used to indicate the global status of the system of processes.
MULTIPROCESSING CONTROL AND ALGORITHMS 573
Example 8.4 The variables p and ¢ are as in Example 8.1. Variables empry and
Jull are also integer variables denoting the number of slots empty or occupied
respectively. Variables np and ne indicate the number of producers and con-
sumers respectively, which are working on the buffer.
The second example on the use of the conditional critical section is the solution
of the reader-and-writer problem. Improper reading and writing of shared variables
is the classic cause of difficulty in finding operating system bugs. The basic problem
is that two sets of processes executing concurrently may interleave read and write
operations in such a way that improper decisions are made and the shared variables
are left in an improper state. This kind of bug is insidious, for it may only show up
infrequently-~and then the symptoms occur rarely or never repeat since they
depend on a particular concurrency relationship.
In the reader-and-writer problem, there are reader and writer processes which
share a common data segment. Any number of readers may access the segment
simultaneously, but a writer must have exclusive access to it. To prevent a writer
WWW.Gitmgurgaon.blogspot.com
574 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Example 8.5
Monitor
Shared
data
associate shared data structures with their critical sections. By so doing, the data
structures are no longer shared or global, but local or hidden within the body of a
monitor, In addition, process functions no longer contain critical sections. Instead,
the critical sections are centralized and protected within the monitor functions.
The restricted access to shared data structures provided by a monitor is even more
attractive if it can be checked by a compiler. Many high-level languages today
provide the means for controlling the scope of variable names.
Monitors provide support for processes to form a multiprogramming system.
While a process is active in the sense that it performs a job, a monitor is passive in
the sense that it only executes when called by a process. A monitor is necessary
only when two or more processes communicate to ensure that they communicate
properly, Figure 8.6 is a representation of two processes communicating through
shared data encapsulated by a monitor,
A monitor consists of a set of permanent variables used to store the resource’s
state. and some procedures, which implement operations on the resource. A
monitor also has initialization code for the permanent variables. This code is
executed once before any procedure body is executed. The values of the permanent
variables are retained between activations of monitor procedures and may be
accessed only from within the monitor. Monitor procedures can have parameters
and local variables, each of which takes on new values for each procedure activa-
tion. The structure of a monitor with name -nname and procedures OPI, ..., OPN
is shown below.
WWW.Gitmgurgaon.blogspot.com
§76 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
mname monitor;
var declarations of permanent variables
procedure OP1 (parameters)
var declarations of variables local to OP1
begin
code to implement OP1
end
When a monitor function is called but is blocked from handling the request
immediately, it may take several actions. It may immediately return a blocked
indication, it may loop or busy-wait until the request can be handled, or it may
place the process on a waiting queue for the resource requested. In the latter case,
the waiting queue must be a part of the monitor data structure. In real-time systems,
it is sometimes best to return a blocked indication and let the process decide
whether to try again later or give up.
An operating system contains a kernel or nucleus which contains a few
special processes to handle initialization and interrupts and a basic monitor to
support the concept of a process. The basic monitor includes functions to switch
environments between processes and to create, spawn, or fork a new process. The
kernel is also one part of an operating system that executes in the privileged state.
Besides the kernel, an operating system consists of many monitors and a few
processes, The processes include several kinds of I/O processes that are activated
as needed and at least one active process to look for new jobs and create user
processes for them. All monitors are part of the operating system and form the
bulk of the system. They are used to manage the resources of the system. For
example, monitors transmit messages between processes, control competing
processes, enforce access rights, and communicate with I/O processes.
Example 8.6 Consider the three concurrent processes P,, P,, and P, sharing
four distinct resources controlled by the four semaphores 5,, 8,, $3, and S4.
All semaphores have an initial value 1. P-V primitives are used to specify the
resource-request patterns shown in Figure 8.7a. Assume one unit of each
resource type.
We use a directed graph in Figure 8.7h to show the possible resource-
allocation ordering, The nodes correspond to resource semaphores, one per
type. An edge (labeled P,) from S; to §, means that resource S; has been allo-
cated to process P, and P, is requesting resource $;. Following this rule, we
WWW.Gitmgurgaon.blogspot.com
578 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
P, P, P,
2 :
. * .
PUS,) P(S,) .
: :
. + *
Figure 8.7 Concurrent processes for the deadlock study in Example 8.5,
so that no precess among the three can proceed. This situation is called system
deadlock or “deadly embrace”.
Suppose we modify the request pattern in process P, by exchanging the
order of requests P(S4) and P(S,). A new resource-allocation graph results in
Figure 8.7¢, where the edge direction has been reversed from node S$, to S,.
Since there are no loops in this modified graph, no circular wait will be possible.
Thus deadlock can be avoided. Because of data dependencies or other reasons,
the change of request order may not be permitted in general. Thus better
techniques are needed to avoid deadlock. We shall show some of these tech-
niques later,
In general, a deadlock situation may occur if one or more of the following
conditions are in effect:
WWW.Gitmgurgaon.blogspot.com
580 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
allocate resources depending upon the current state of the system. These methods
lead to better resource utilization,
One basic model that is assumed consists of a sequence of task steps during
each of which the resource usage of the task remains constant. The execution of a
task step first involves the acquisition of those resources needed by the given
task step but not passed on by the previous task step. Next follows a period of
execution during which the resource requirement remains invariant. Finally, at
the completion of execution, all those resources not needed by the subsequent
task step are released and returned to a pool of available resources.
Before discussing the avoidance techniques, it is convenient to describe another
model for the resource-allocation system in a multiprocessor, A resource-allocation
system (RAS) includes a set of n independent processes P,, P,,...,P, (2 2 l,a
set of m different types of resources R,, Rz,..., R,,(m = 1) so that each R; has a
fixed number of units c;. The RAS also includes a scheduler that allocates the
resources to the processes according to certain rules fulfilling some specified
criteria.
The system state ofa RAS is defined by (W, A, f), where W = (w,,w,,....W,)
is the request matrix, which has the dimensions n by m, The entry w,; = w,(j) is the
maximum number of additional units of resource R, that the process P, will need
at one time to complete its task. w; is the want vector for process P;-A =
(a,,a,,...,4,)is the allocation matrix (n x m). The entry A,; = aj) is the number
of units of resource R ; allocated to process P;. a, is the allocation vector [or process
P,;. The vector f = (f,, fa..-.. sf) is the free (available) resource vector and ¢ =
(Cy: Cas. + +s Cy)is the system capacity vector. Since f; < c;is the number of available
units of resource type R;. we can find f; to be
That is, the sum of the resources allocated and those available of type R,; must be
equal to the total number of units of that type in the system.
When A = 0, the system is in the initial state. In this state, D = (d,. d,,....d,)
= W is called the demand matrix and ¢ = f, where d, is the demand vector for
process P;.
Certain basic assumptions are made regarding avoidance methods. Before a
process enters the system, it is required to specify for each resource the maximum
number of resource units it will ever need. There is no preemption and a process
releases a resource after it has completed its task. Moreover, d; < ¢ for all i.
A sequence of task steps Poi) Paz) °° + Pegg i8 called a terminating sequence for
(W, A, f), where e(/) is the index of the process in the jth place, if
Woy SF (8.2)
and
int
Wey SE + Say, forl <i<k
ia
WWW.Gitmgurgaon.blogspot.com
882 COMPUTER ARCHITECURE AND PARALLEL PROCESSING
Note that each occurrence ofa process in the sequence goes through the following
cycle: request resource, use resource, release resource. Hence the want vector for
a process must not be greater than the free resource vector plus the “released”
resource vector for the process to be run. A terminating sequence is complete if all
processes are in the sequence.
The system state (W, A. f) is safe if there is a complete terminating sequence
for it; that is, if there is a way to allocate the resources claimed by the processes
so that all of them can finish their task. The safeness of a state can be expressed as
a safe request matrix S, where S,; is the maximum number of units of resource R;
that can be granted safely if process P,; requests them.
Example 8.7 If we restrict each process to making a single request for a finite
number of units and the state of the system (W,AJ) for three processes and two
resource types is:
OD
on =
OO
Oe
W = A= f=(1,1)
he
em
me
then the system is in a safe state. Notice that from Eq, 8.1, ¢ = @, 2) and
P| P,P, is a complete terminating sequence. If P, requests only one unit of
R,, the system will be in a safe state, since in that case P, P,P, is a terminating
sequence containing P,. Therefore, $,, = 1. Similarly, if P, requests only
one unit of Rz, 8, , = 1. However, if P, requires at the same time one unit of
R, and one unit of R,, the request cannot be granted safely.
This example shows that the matrix S can be used only for single requests. If
a process requires more than one resource, a rule stating the order in which the
resources will be requested must be defined. In general, the computation ofS may
be time consuming. However, it is possible to compute S concurrently with process
execution in a multiprocessor system. This reduces the overhead significantly.
The “while” statement searches for an index iin ROWS such that a, < v. If none
is found, a nonempty set in ROWS indicates the existence of a deadlock; otherwise,
there is no deadlock. An important point to remember is that a program bug can
cause a deadiock even in situations where deadlocks theoretically cannot occur.
An ultimate time out can be a simple defensive check on the correctness of the
system as well as a way to prevent indefinite deadlocks.
Given that a deadlock has cccurred, perhaps the simplest approach to recover-
ing from it would involve aborting each of the deadlocked tasks or, less drastically,
aborting them in some sequence until sufficient resources are released to remove
the deadlocks in the set of remaining tasks. Algorithms could also be designed that
search for a minimum-sized set of tasks which, if aborted, would remove the
deadlocks. A more general technique involves the assignment of a fixed cost b; to
the removal (forced preemption) of a resource of type R; from a deadlocked task
that is being aborted.
WWW.Gitmgurgaon.blogspot.com
584 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
occurrences of errors but are intended to limit their incidence on the protected
objects. An addressing mechanism that decomposes the space of objects into
protected domains is one scheme that makes the confinement and the non-
propagation of errors easier. A domain is the set of objects that may be directly
accessed by the process to which authorization is granted.
One common set of objects that is protected in a computer system is the set
of memory locations. [t is imperative to protect the address space of one task from
the other by enforcing a separation of address space among different tasks.
The protection of shared memory is classified into the protection of the physical
and local address spaces.
The protection of physical address space could be accomplished by partitioning
the physical memory into nonoverlapping blocks (page frames), each of which is
assigned a lock. Each process has a key (often called an access key) as part of its
process-identification word. Access to a block of memory by a process is granted
if the process’ access key fits the lock of the block. In fact, any process with a match-
ingkeycan have access to the bleck. Hence, this scheme is not an effective protection
mechanism if the blocks are shared by several processes.
Memory protection on the virtual address space is more effective. In systems
that use base or relocation registers for mapping the virtual address to the
physical address, protection is accomplished with bounds registers. Bounds registers
specify the lower (base address) and upper bounds of the address space. An access
by a process to memory outside the predefined bounds is trapped by the system
as an access violation. This technique assumes that the address space for the process
is in contiguous memory locations. The protection of the address space is further
ensured since systems do not permit the modification of the bounds register by a
user process. Sharing of address space by more than two concurrent processes is
still difficult since the address space of each process is contiguous and the bounds
registers perform linear mapping of addresses.
Another method for memory protection is for each task to have a segment
table (ST), which consists of entries that include the set of access rights and the
base address to pages or segments of the task. During the execution of a task, it
uses the segment table base register (STBR) to cbtain access by authenticat-
ing the privileges of the processes, as shown in Figure 8.8. If the access-code
field (AKR) of the virtual address matches the permitted access rights for the
segment (acc) in STE(s), the segment is accessed. When a task switch occurs, the
STBR is modified to point to the ST of the new task.
In process systems, there is a classic distinction between user privileges and
supervisor privileges. The protection of supervisor processes from user processes
is enforced partly by a hardware mechanism which defines two operatmg modes:
supervisor mode and user mode. This can be accomplished by a single bit in the
processor’s control register. Generally, there are machine instructions, called
privileged instructions, which are not executable by user processes. These instruc-
tions can be considered the set of objects S which needs to be protected from the
user that attempts execute access. The user domain in this case lacks all rights in
the set R = {r|r = (s, execute), s ¢ 8}. The protection is enforced by trapping the
MULTIPROCESSING CONTROL AND ALGORITHMS 585
Access key register
[axe] ,
Segment number Word index (STBR)
Segment
<a>
in,
Segment
<a>
Segment STEts)
<I>
Address] ace] £ F
+4
—_.—_
user process that attempts to use r¢ R. The user process can request to use a
system procedure that contains privileged instructions by migrating from the
user domain to the supervisor domain, where the supervisor can access r € R on
behalf of the user process.
The multilevel protection scheme as implemented in Multics can be used
effectively in multiprocessors. In this scheme, which is often called the layered
protection system, the basic unit of protection is a segment. Segments are grouped
into a set ofn levels or classes. The layering scheme results in a nested ring structure
consisting of# rings, as shown in Figure 8.9. Each level or class of segments is
assigned to a distinct ring. Therefore, the implementation of the virtual address
and the ST entry will have an extra field for representing the ring number of the
segment.
WWW.Gitmgurgaon.blogspot.com
586 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Ring x-1
Figure 8.9 Layered protection schente in the MULTICS and segment table entries.
The access capabilities of ring r, is a subset of ring r, whenever ring r; > F,, for
ry 7,10, 1,...,2 — 1}. Access control is performed between classes and not
within a class. Hence, one segment can reference another segment without a
validation check if both segments are in the same class, However, the crossing of
a ring boundary results in an access fault, which invokes the operating system to
perform a validation check. If the crossing is from a ring 7; to ring r; + 1, then the
access is permitted. If the crossing is from ring r, to ring +; — 1, the operating
system validates the permission. To call an inner ring, only certain entry points are
permitted. These entry points are called gates. The segment which makes the call
must present a valid key to match one of the locks at the gate. The validation
process is performed by a gatekeeper process. The list of entry points corresponding
to the set of locks 1s called a gate list.
In this system, segments that belong to a process can, for example, be assigned
to a given ring. If the segments are shared by multiple processes, the disjointness
MULTIPROCESSING CONTROL AND ALGORITHMS 587
WWW.Gitmgurgaon.blogspot.com
588 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
0
5 | Procedure
Local 5 seement
memory 2
(LM) .
:
nw] l ee
Local € list
Domain: D, [ RF |
LM
a
Process: P, ‘
C list
Domain: D,
Process: P,
Figure $.10 Domains, capabilities, and objects.
MULTIPROCESSING CONTROL AND ALGORITHMS 589
C dist
: Master object
Index . table
(MOT)
| 10 } > «
Type Segment :
| 7 >
MOT index 7 @& — oT
4339963
Unique name 4339963 Size
The MOT entry pointed by the capability contains the absolute address of the
object and its unique identifier. The MOT concept makes addressing of the object
easy and a relocation of the object needs only the updating of an address in the
MOT, even if this object is shared between several processes. It also makes the
access slow.
In general, protection can be applied on an object or a path to the object. For
example, the access-matrix concept applies the protection on the path, while the
entry in the segment table applies it on the object. Protection placed on the access
path to objects generally requires less overhead than protection placed on the
objects. However, in some cases, both methods are required.
Capabilities have an advantage over privilege-checking in that the protection
check is performed at the beginning of the cbject name interpretation without
leaving the execution environment. Hence the error is confined to the execution
environment. If a process refers to an object through C lists, it is impossible to
name any object to which the process has no access rights. However, it can become
wasteful in storage space and the overhead in leading and saving a C list upon
WWW.Gitmgurgaon.blogspot.com
599 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Blocked
Terminated/
inactive
Transition Event
I Activate process
2 Run process
3 Preempt process
4 Block process
5 Wakeup process
6 Terminate process
Figure 8.12 States of a process and their state transitions,
WWW.Gitmgurgaon.blogspot.com
592 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
The ready list can either be local or global. A local ready list may be associated
with each multiprogrammed processor which has a local memory. Thus a process,
once activated, may be bound to a processor. The local ready list reduces the
access time of the list and, hence, the overhead encountered by the dispatcher.
However, the local ready list concept discourages process migration. Moreover,
under light system load, the processor utilization may not be equally distributed
among all processors. To permit process migration, a global ready list which
resides in the shared memory may be used. This has the disadvantage of requiring
more overhead in saving and restoring process states by the process scheduler.
However, the standard deviation of the processor utilization is small.
The general objectives of many theoretical scheduling algorithms are to
develop processor assignments and scheduling techniques that use minimum
numbers of processors to execute parallel programs in the least time. In addition,
some develop algorithms for processor assignment to minimize the execution time
of the parallel program when processors are available.
There are basically two types of models of processor scheduling, deterministic
and nondeterministic. In deterministic models, all the information required to
express the characteristics of the problem is known before a solution to the
problem, that is, a schedule, is attempted. Such characteristics are the execution
time of each task and the relationship between the tasks in the system. The objective
of the resultant schedules is to optimize one or more of the evaluation criteria. For
example, in deterministic models, the execution time of each process can either be
interpreted as the maximum processing time or as the expected processing time.
In the former case, the time to complete the schedule would be considered
the maximum time to complete the system of processes, and in the latter
case, the length of the schedule represents a rough estimate of the mean length
of the computation. The motivation for this objective is that, in many cases,
a poor schedule can lead to an unacceptable response time or utilization of
system resources.
Deterministic models are not very realistic and do not take into consideration
the irregular and unpredictable demands made on the multiprocessor system.
Hence, stochastic models are often formulated to study the dynamic-scheduling
techniques that take place in the system. In stochastic models, the execution time
of a process is a random variable ¢ with a given cumulation distribution function
(edf) F.
Processor scheduling implies that processes or tasks are to be assigned to a
particular processor for execution at a particular time. Since many tasks can be
candidates for execution, it 1s necessary to represent the collection of tasks in a
manner which conveniently represents the relationships (if any) among the tasks.
Generally, we will refer to a set of related processes as a task system or a job. A
jeb, which consists of a set of processes, is represented as a precedence graph, as
shown in Figure 8.13. The nodes in the graph are tasks which may represent
independent operations or parts of a single program which are related to each
other in time. The collection of nodes represents a set of processes T = {T,,..., T,}.
and the directed edge between nodes implies that a partial ordering or precedence
MULTIPROCESSING CONTROL AND ALGORITHMS 593
relation < exists between the processes. Therefore, if 7; < Tj, process 7; must be
completed before T; can be initiated. Processes with no predecessors are called
initial processes (e.g., T,), and those with no successors are called fina! processes
(e.g., Tyo). The individual nodes within a graph can be related to each other in a
number of ways.
For example, it is possible for all processes in a graph to be independent of
each other. In this case, there is no precedence relation or partial ordering between
the processes and all the processes can be scheduled concurrently, provided there
are enough processors available. The width of the task graph G, denoted by
width(@), is the maximum size of any independent subset of processes. In Figure
8.13, T, < T,, T, < T, 1% < T;, and T; < T;. The width of the graph is 3.
Associated with each node is a second attribute which refers to the time required
by a hypothetical processor to execute the code represented by the node. Some-
umes, this attribute is called the weight of the node. In a deterministic model, this
attribute is a constant for each node, whereas, in a stochastic model, it is a random
variable with a mean and standard deviation or a known distribution.
Given a computation graph and a multiprocessor system with p processors, a
task assignment or a schedule must be developed such that it gives a description of
the processes to be run and in what order as a function of time. The schedule must
WWW.Gitmgurgaon.blogspot.com
594 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
not violate any of the precedence relationships or the requirement that no more
than one processor can be assigned to a task at any time.
Ina multiprocessor system, one may associate a process descriptor node (PDN)
with each executable and active process. This process may consist of a number of
fields, as shown in Figure 8.14. The parent field is a pointer to the set of processes
which initiated the process, the child field points to the set of processes to be
initiated by the current process and the register state defines the register values of
the given process, The kernel of the operating system uses the concept of PDNs
to monitor the status of the processes.
In some systems, when processes are created, they exist as unrelated units,
independent of each other. In other systems, the order of creation is remembered
and a parent-child relationship is maintained between one process and the new
process it creates. Both approaches have advantages and disadvantages. Typically,
a child process is limited to using only those resources owned by its parent and is
deallocated if its parent terminates. In most general-purpose systems, when a
process is destroyed, its process image is returned to a pool of unallocated memory.
However, in many dedicated or real-time systems, processes are never destroyed.
Instead, they are created at compile time or initialization time and run forever,
even at times when there is no work to do.
During the execution of a process on a processor, external stimuli such as
interrupts may arrive and require urgent service because of input or output device
constraints. If the interruption and subsequent resumption of the process in
execution is permitted before its termination, preemptive scheduling is used. If
PDN
Process identification
Priority
Parent - se 08 ote
IRE
Delay ime
Process status
Register state
Size of process
Pointer to
ST(P) _ [* Segment table
Children > apo 4 oe
ae
aH
fen
STE(s)
Address Jacc[Z|/)
WWW.Gitmgurgaon.blogspot.com
596 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Py T, Ty ®
P, Tt, 7,
Ps q; qs Ts ®
T T I
6 1 2 3 4 5 6 7
Figure 8.15 Task schedule in Gantt chart form.
MULTIPROCESSING CONTROL AND ALGORITHMS 597
PII) ¢@ Tt r %
P| ¢ tr tr t, Ty ¢
T F 7 T T T T T E
O 2 4 6 8 16 12 14
AL nT SG T; fr, r, 1,
Pi) ¢ Ty, T, ¢ q, ¢
U q q U i q q q ' ' T
0 2 4 6 8 10 12 14 16
Figure 8.16 Task schedule in chart form, using p = 2 processors. (Courtesy of 4CM Computing Surveys,
Gonzalez, Sept. 1977.)
WWW.Gitmgurgaon.blogspot.com
598 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
The mean utilization U, of the p processor system in the case of Figure 8.16
is U, = (30 — 3)/30 = 0,9, for p = 2, The reader can easily show that by increasing
the number of processors to 3, the speedup does not increase. In fact, the utilization
reduces to 0.6. Hence, the execution of the process graph in Figure 8.16a is most
cost-effective on a two-processor system. The rationale behind the minimization of
finishing or completion time is that system throughput can be maximized if the
total computation time of each set of processes is minimized. Throughput 1s
defined as the number of process sets processed per unit of time.
There are at least two reasons for minimizing the number of processors
required to process a process system. The first and most obvious is cost. The
second reason is the processor utilization. If the number of processors required
to execute a set of processes in a given time is less than the total number of pro-
cessors available, then the remaining processors can be used as backup processors
for increased reliability and as background processors for noncritical com-
putations.
A key issue in the study of processor scheduling is the amount of overhead or
computation time needed to locate a suitable schedule. A scheduling algorithm is
a procedure that produces a schedule for every given set of processes. An efficient
scheduling algorithm is one that can locate a suitable schedule in an amount of
time that is bounded in the length of the input by some polynomial. Construction
of optimal schedules is NP-complete in many cases. NP-complete implies that an
optimal solution may be very difficult to compute in the worst possible input
case. However, construction of suitable schedules, that is, computing a reasonable
answer for the typical input case, is not NP-complete. Therefore suitable schedules
can be obtained for concurrent processes.
In this subsection, we examine deterministic schedules which can be used to
optimize measures of performance. Unless stated explicitly, we assume a scheduling
environment which consists of a number of identical processors, a set of processes
with equal or unequal execution times and a (possibly empty) precedence order.
First we consider preemptive schedules using two pracessors.
In order to understand the preemptive schedule (PS) on p processors, we
define process graphs with mutually commensurable node weights. A set of nodes
is said to be mutually commensurable if there exists a ¢ such that each node weight
is an integer multiple of ¢. In a preemptive schedule, a processor may be pre-
empted from an executing process ifsuch an action results in an improved measure
of performance. The PS algorithms are due to Muntz and Coffman (1966).
Assume that the process graph consists of n independent processes with
weights (process duration or execution times) of t,, t2,....¢, and p processors.
The optimal PS has a completion time of:
The optimal PS length cannot be less than the larger of the longest process or the
sum of the execution times divided by the number of processors.
MULTIPROCESSING CONTROL AND ALGORITHMS 599
For their optimal algorithm, the set of nodes of unit weight in a graph are
partitioned into a sequence of disjoint subsets such that all nodes in a subset are
independent. All nodes in the same subset or at the same level are candidates for
simultaneous execution or group scheduling. In a graph of N subsets or levels,
the terminal node occupies the first level exclusively. Those nodes which may be
executed during the unit time period preceding the execution of the terminal node
occupy the second level, and so on. The initial or entrance node in the graph
occupies the Nth level. Such an assignment of levels generate what is called
precedence partitions.
In particular, the assignment procedure outlmed above corresponds to the
latest precedence partitions. That is, the assignment of nodes to levels is done in a
manner which defers process initiation to the latest possible time without increasing
the minimum completion time. Such a schedule is called the latest-scheduling
strategy. This strategy assumes that the number of processors available is greater
than or equal to the maximum number of processes at any level (width of G). This
strategy may be contrasted with the earliest-scheduling strategy, which schedules
a process as soon as a processor is available and the precedence constraints have
been satisfied. Note that the earliest strategy produces earliest-precedence parti-
tions.
For any arbitrary graph G, a precedence relation will exist between the subsets
of the latest strategy due to the precedence which exists between the nodes in the
original graph. A PS can be constructed for graph G by first scheduling the
highest-numbered subset, then the subset at the next lower level, and so on. Note
that when a subset consists of only one node, a node from the next lower subset is
moved up if it does not violate the precedence constraints of the original graph.
If each of the subsets is scheduled optimally, a subset schedule results. For two
processors and equally weighted nodes, an optimal subset schedule for G is an
optimal PS for G.
This result is extended to the case of graphs having mutually commensurable
node weights. In order to generate the optimal result, it is necessary to convert
graph G into another graph G,, in which all nodes have equal weights. This is done
by taking a node of weight ¢; and creating a sequence of n nodes such that t; = at,
as illustrated in Figure 8.17. Note that the integrity of the original graph must be
maintained. It can then be shown that an optimal subset schedule for G,, is an
optimal PS for G, with k = 2.
In this approach, one must note whether the number of processes at any level
is even or odd. If it is even, then all processes at that tevel can be executed in the
minimum amount of time with no idle time for either of the two processors. If
the number of processes is not a multiple of two, then the last three processes to
be scheduled at that level can be executed in no less than 4 unit, since all pro-
cesses in G,, are of unit duration. By using the form shown in Figure 8.18, three
processes in a given level can be executed in minimum time without processor idle
time. Since scheduling in this manner ensures that no processor is idle, the subset
sequence can be seen to generate a minimal-length PS. An example of the optimal
PS algorithm is shown in Figure 8.19, For this example, the optimal subset sequence
for G is {Ty}. 17a, Tat 175. Ts, Trt. (Tas Ty} i Tos Thots (Titt-
WWW.Gitmgurgaon.blogspot.com
600 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
The optimal results derived above can be extended to the case in which any
number of processors are allowed when the computation graph is a rooted tree
and the node weights ¢, are mutually commensurable. A reored tree is one in which
each node has at most one successor, with the exception of the root or terminal
node, which has no successors. We discuss below some techniques for nonpre-
emptive schedules,
Recall that, in nonpreemptive or basic schedules, a processor assigned to a
process is dedicated to that process until it is completed. The initial investigations
discussed here develop optimal nonpreemptive two-processor schedules for
arbitrary process orderings in which all processes are of unit duration. A particular
MULTIPROCESSING CONTROL AND ALGORITHMS 661
—
aa 7, | T,
Pf mn] of
je 1. 5+]
Figure 8.18 Minimum-time execution format for three unit tasks with two processors.
(a)
WWW.Gitmgurgaon.blogspot.com
602 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Onin = Snax
The optimal solutions by Hu are limited to rooted trees. Using the labeling
procedure described above, one can obtain an optimal schedule for p processors
by processing a tree of unit-length processes in the following manner:
MULTIPROCESSING CONTROL AND ALGORITHMS 603
{a)
Pl te | te {ts ] | |u| GA 4| 7
0 2 4 6 8 10
(a)
Figure 8.20 Tlustration of Coffman and Graham algorithm. (a) Task graph with reassigned subscripts
L* = (Fygs Tig. +++. 7s (4) optimal schedule. (Courtesy 4CM Computing Surveys, Gonzales, Sept.
1977.)
1. Schedule first the p (or fewer) nodes with the highest numbered label, ie., the
starting nodes. If the number of starting nodes is greater than p, choose p nodes
whose «; is greater than or equal to the «; of those not chosen. In case of a tie,
the choice is arbitrary.
2. Delete thep processed nodes from the graph. Let the term “starting node”
now refer to a node with no predecessors.
3. Repeat steps | and 2 for the remainder of the graph.
WWW.Gitmgurgaon.blogspot.com
604 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
(e)
Figure 8.21 Mlustration of Hu’s optimal algorithm. (a) Rooted tree fabeled according to Hu’s procedure;
(4) optimal schedule for three processors. (Courtesy ACM Computing Surveys, Gonzalez, Sept. 1977.)
The schedules generated in this manner are optimal under the stated constraints.
The labeling and scheduling procedures are quite simple to implement and are
illustrated in Figure 8.21.
Recall that the minimum time required to execute a task graph by Hu’s
procedure is ¢,,,,. SUppose we wish to process a graph within a prescribed time ¢,
where € = Ging, + C and C is a nonnegative integer. The minimum number
MULTIPROCESSING CONTROL AND ALGORITHMS 605
fe 1 oz
pol < ee wa j
See G2 Pm +1—f/l<p (8.4)
where g(i) denotes the number of nodes in the graph with label x, and y* is the
value of the constant y, which maximizes the given expression. To illustrate this
result, consider Figure 8.21. For C = 0, for example, value y* occurs when y = 1
or y = 2. This indicates that, in order to process the graph in minimum time, four
processors are needed. For C = 1, ¢ = 8 and y* oecurs when 7 = 2 or » = 5, and
three processors are required. Varying C further, we find that three processors are
required when the processes must be processed within nine units, but only two pro-
cessors are needed for a maximum processing time of 10 units.
Another study by Graham shows that, for a computing system with n identical
processors in which processes are assigned arbitrarily to the processors, the
completion time of the set of processes will not be more than twice the time required
by an optimal schedule. This bound was derived in connection with the so-called
multiprocessor anomalies. These anomalies are derived from the counterintuitive
observation that the existence of one of the following conditions can lead to an
Increase in execution time:
1. Replace a given process list L by another list L’, leaving the set of process times
u, the precedence order <, and the number of processors n unchanged.
2. Relax some of the restrictions of the partial ordering.
3. Decrease some of the execution times.
4. Increase the number of processors.
wo n— |
—<1+——— (8.5)
co fn
This bound is the best possible, and, for a = n’, the ratio 2 — 1/n can be achieved
by the variation of any one of L, 4, or <.
The above result was extended to a nonhomogeneous processor system by
Liu and Liu. Suppose a multiprocessor system consists of n, processors of speed
bh, fori= 1,2,....k, such that w, > uw > +++ > a, 2 1. then
Oey k
(8.6)
a L
Me SY a
i=l
WWW.Gitmgurgaon.blogspot.com
606 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Example 8.8 Consider a system with one processor of speed five and five
processors of speed one. By Eq. 8.6, we have
a’ < 5 4d 5 ou
wo” 10° 2
Comparing this bound with that in Eq. 8.5 for a multiprocessor system with
10 identical processors of speed 1 (by substituting five identical processors of
speed | for the processor of speed 5), the ratio 2 ~ 35 is achieved, The deter-
mination ofa close to optimal schedule is more important for a heterogeneous
system than for a homogeneous system.
In both cases, L; and £, refer to the ith latest and earliest precedence partitions,
respectively, and |x| represents the cardiality of the set x. The processes contained
in L, a E, are called essential processes. Those processes contained in the ith
subset given by L; ~ E; must be initiated i ~ 1 units after the start of the initial
process in the graph to guarantee minimum execution time,
Note that + and max are commutative and associative operations, respec-
tively. Moreover, + distributes over max. For example, max(a,b) + c =
max(a + c, b + c). Thus, if G is simple, the expression for fg above can be factored
in terms of max and + so that each random variable appears only once. Then,
if the 17s are independent, the expression for F may be found by substituting F;
for t,, * (convolution) for +, and - (multiplication) for max in the expression for
tg. The convolution of edfs F, and F, is written as follows:
For the example in Figure 8.23, there are three chains, C,; = 7,7, 75, Cy =
T, T, T;,and C, = T; T,. Therefore
WWW.Gitmgurgaon.blogspot.com
608 COMPUTER ARCHITECTURE ANID PARALLEL PROCESSING
Xs
(a) Gy: simple xyxy X54 XX Xs +
Ng Mg = [Ce] + Xa) + Hy Le
ig = Maxjmax(t,,
£2) + (5, ty} + ts (8.11)
MULTIPROCESSING CONTROL AND ALGORITHMS 669
Chains:
Ca 777,
C= TTT,
C= GT,
Fy = (FF) « FF) + Fs
Figure 8.23 Computation of t, and F, for task graph G. (Courtesy IEEE Trans. Software Engg., Robinson,
Jan. 1979.)
In Eq. 8.11, cach random variable appears only once. Hence, F,, may be found by
the substitution rule:
Level
M, 3
WWW.Gitmgurgaon.blogspot.com
610 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Merging is a method of combining two or more sorted files into a composite sorted
file. In Figure 8.24, each M;, 1s a merge of the sorted subfiles produced by its im-
mediate predecessors.
If all permutations of keys are equally likely, then the execution times of S,,
S,,53, and S, have the same edf and the execution times of M, and M, have the
same edf. Let the edf of the execution times of the S/s be F, and that of M, and
M, be F,. Furthermore, let the edf of the execution time of M, be F,. Then
width(G) = 4. G is simple and the process-execution times are independent. Let
the execution times of the 5;'s and M, be ts, and t,,,, respectively. Hence, the
execution time of the four-process merge-sort is
tg = Max{Max(ls,, ty.) + far, MAR(Ls,. bs) + Ouat + bv,
Since ts,, ts,. fs,. ts, have the same edf F, and ty, (4, have the same cdf F,, the
cdf of t,,, 1s Fy:
Fa Y = (FT 2 ® Fy) (FU * F)* Fy = (F7 * Fi)? * Fs (8.13)
This should be compared with the edf of the execution time for a one-process
(sequential) merge-sort:
Poeg = Fp e Fe Pye Eye Py Fy Py (8.14)
since the execution time for the sequential merge-sort is
teeg = ts, + ts, Hb ty, + tse + fu, + tae * tus (8.15)
Notice that Eg. 8.15 assumes that the processing environment of the sequential
merge-sort is the same as the concurrent merge-sort. In practice, this is not true,
since the sequential merge-sort does not encounter interprocess-communication
problems or memory conflicts which create overheads in the concurrent merge-
sort. Hence, in practice, f,.4 is usually less than that predicted by Eq. 8.15. In the
next section, we consider the effect of these overheads on the performance of the
algorithm.
Let ug and y,., be the mean execution times of the probability density functions
fe = Fg and fog = Foegs tespectively. We can then estimate the theoretical
speedup of the four-process merge-sort as
8, = 28 (8.16)
seq
Equation 8.9 is not very useful when the edfs of the process exccution time
are not known. Bounds can be derived for the mean execution time by using more
limited knowledge about the execution times of processes. Let us denote the
expected value ofa random variable x by E(x), The level ofa process T in a process
graph G is the maximum length of any chain in G from an initial process to T.
The depth of G, denoted by depth(G), is the maximum level of any process, Given
a process graph G with the number of available processors K 2 width(G) and
with the ¢; independent, let C,, C,....,C,, be all chains in G from initial to final
MULTIPROCESSING CONTROL AND ALGORITHMS 611
processes. Also let H,; be the set of all processes of level i, for 1 < i <= L, where
L = depth(G). For any set of n random variables {x;},
from which the lower bound follows, For the upper bound, let to = 0 and define
Sf) = Oif C, 7 H,is empty, otherwise /(, j) is the index of the single process in
Co H;. Then from Eq. 8,9,
Therefore
The upper bound in Eq. 8.19 is useful only if something can be said about
FE(max{t,}). An applicable result from order statistics is that, if the random variables
Xyy Xg,.-+)X are independent and identically distributed (i.i.d.) with the mean
ue and standard deviation ¢, then
m—l
F, max bx So me (8.20)
Lsism a/2m— 1
Hence, ifthe number of available processors K > width(G), the ¢;s are independent,
depth(G) = L and the m, processes on level/ have identically distributed execution
times with the mean y; and standard deviation o,, then
WWW.Gitmgurgaon.blogspot.com
612 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
mean interarrival time of processes to the system is 1/4. Assuming that the service
and interarrival times are exponentially distributed and. the service discipline for
processes is the common first-come-first-serve (FCFS), various performance
factors can be obtained. Figure 8.25 illustrates the resultmg queueing model in
which processes arrive at a rate A and are serviced at a rate yz. The utilization of the
processors is
p =p 8.22
(8.22)
where u is the traffic intensity and is defined by u = 7/4. The mean response time
of processes is [Kleinrock (1976)]
Rip,
wy = Pg I (8.23)
upp) 4
where C(p, u) is Erlang’s C formula and is given by
uP
Cp, u) = aT, a
uw + pl ~p) nadD5fe
There are a number of other scheduling algorithms, such as round robin (RR),
preemptive, and nonpreemptive priority service disciplines, which can be modeled
by the use of queueing systems. In an RR service discipline, each time a process is
selected for execution, itis selected from the head of the ordered queue and allocated
a fixed duration of run time called the time slice or guantum. Hf a process terminates
execution before the end of the quantum, it departs from the processor. If at the
Task arrival
ALA
tate, > ‘
Queue
Processors
Figure 8.25 Queueing model of first-come-first-serve scheduling discipline in a multiprocessor system.
MULTIPROCESSING CONTROL AND ALGORITHMS 613
end of the quantum the process has not completed its execution (requires additional
quantum), it is recycled to the end of the queue to await its next selection. New
process arrivals simply join the end of the queue. The RR service discipline can be
combined with the preemptive priority discipline to create a multilevel round-robin
scheduling discipline. This discipline is used to give higher priority processes
more frequent control than lower priority ones, Policies based on priority can be
static (if the priority of a process remains fixed) or dynamic (if the priority of a
process is allowed to change).
In the RR service discipline, a process in the run state is interrupted at the
end of its quantum and may enter the ready state. An external event may cause the
blocking of a running process. These transitions may necessitate a context switch.
Furthermore, a running process can cause an explicit process switch by invoking
a privileged instruction, For example, in the case of a fault, the process can cause a
trap which switches context to the operating system, as in the IBM 370 supervisor
call (SVC) instruction to be described in Chapter 9.
Long-term scheduling operations are used to control the load on the multi-
processor system by making decisions on activating new processes. One method
to implement the schedule is to use priority queues for incoming processes.
Prioritization of processes in a system may result in indefinite postponement of
low priority processes if the arrival rate of the high priority processes is high. A
set of processes which cooperate to solve a problem may be given higher priority
than a single independent process.
Since there are many processors as well as memory modules to be scheduled,
it may be useful to perform group scheduling, in which a set of related processes are
assigned to processors to run simultaneously. Group scheduling can be extended
to make placement decisions for groups of objects at a time, or to swap groups of
related objects in and out. These different group schedulers have several possible
advantages. First, if closely related processes run in parallel, blocking due to
synchronization and frequency of context switching may be reduced. These will
in effect aid in increasing performantice. Second, if placement decisions are made
for a group of objects with known reference patterns, the “distance” between the
various processes and their referenced objects might be minimized. Hence,
effective memory management for a set of related processes is easier since
the time period for sharing is restricted to the short presence of the processes
in the system. In general, a group assignment will not be very successful in lessening
the number of context switches unless the processes within the group are“ in step” so
that few of them will be blocked from lack of input or other synchronization
requirements.
In this section, we describe and classify the various types of parallel algorithms.
The characterization of parallel algorithms will help in the design and analysis of
these algorithms. Some example algorithms are given. Techniques are shown to
determine the performance of MIMD parallel algorithms.
WWW.Gitmgurgaon.blogspot.com
614 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Although extensive research has been performed on SEMD algorithms, there are
few results available concerning the specification, design and analysis of MIMD
multiprocessor algorithms. That is the basis for this section. A parallel algorithm
for a multiprocessor is a set of k concurrent processes which may operate simul-
taneously and cooperatively to solve a given problem. If k = 1, it is called a
sequential algorithm. To ensure that a parallel algorithm works correctly and
effectively to solve a given problem, processes interact to synchronize and exchange
data. Hence, in a task system, there may be some points where the processes
communicate with other processes, These points are called interaction points. The
interaction points divide a process into stages. Therefore, at the end of each stage,
a process may communicate with some other processes before the next stage of
the computation is initiated.
Because of the interactions between the processes, some processes may be
blocked at certain times. The parallel algorithms in which some processes have to
wait on other processes are called syachronized algorithms. Since the execution
time ofa process is variable, depending on the input data and system interruptions,
all the processes that have to synchronize at a given point wait for the slowest
among them. This worst case computation speed is a basic weakness of syn-
chronized algorithms and may result in worse than expected speedup and processor
utilization.
To remedy the problems encountered by synchronized parallel algorithms,
asynchronous parallel algorithms exist for some set of problems. In an asynchronous
algorithm, processes are not generally required to wait for cach other and cam-
munication is achieved by reading dynamically updated global variables stored
in shared memory. However, because of the concurrent memory accesses per-
formed, conflicts may occur which will introduce some small delay in pracesses
accessing common variables, Fer convenience, we shall often refer to synchronized
and asynchronous parallel algorithms simply as synchronized and asynchronous
algorithms, respectively.
Anather alternative approach to constructing parallel algorithms is macro-
pipelining, which is applicable if the computation can be divided into parts. called
stages, so that the output of one or several collected parts is the input for another
part. The program flow is illustrated in Figure 8.26. Because each computation
part is realized as a separate process, communication costs may be high unless
communication is achieved by address transmission. The question may arise as
to whether to move the output data to the site of the next process in the pipeline or
to move the next process, in particular its code, to the site of the data.
As an example, consider a simple pipeline compiler. Different processes are
responsible for lexical analysis, syntax analysis, semantic analysis, optimization,
and code generation. Source input ts lexically analyzed and the recognized lexemes
are input to the syntax-analysis process, thus building input for the semantic
analyzer that, in turn, produces a tree for the code generator. Generated code is
adapted by the final optimization process before being archived as the final
MULTIPROCESSING CONTROT AND ALGORITHMS 615
Buffer
Input
data
set
compiler output. Note that the processes that result from pipelining are hetero-
geneous, while those resulting from partitioning are homogeneous.
The time taken to execute a fixed stage of a process is a random variable
satisfying some cumulative distribution function. The fluctuations may be due to
the variability of the processor's speeds and the input to the stage. A process may
be blocked at the end of a stage because it is waiting for inputs in a synchronized
algorithm or for the entering of a critical section in an asynchronous algorithm.
The blocking time of a process is the total time that the process is blocked. If the
multiprocessor system is heterogeneous, the execution time of a process will be
smallest if the process is assigned to run on a faster processor. As an illustration
of the variability due to input, we recall that the number of comparisons needed
to sort # elements by the Quicksort algorithm ranges from O(n log, m) to 0(n7),
depending on the ordering of the input elements. The fluctuations in execution
time may also result from delays due to memory conflicts, system interrupts, page
faults, cache misses and the system work load. A typical source of nonnegligible
overhead is that due to the execution of synchronization primitives. Synchroniza-
tion primitives are needed for synchronizing processes and implementing critical
sections.
An algorithm which requires execution on a multiprocessor system must be
decomposed into a set of processes to exploit the parallelism. Two methods of
decomposition naturally arise: static decomposition and dynamic decomposition.
In static decomposition, the set of processes and their precedence relations are
known before execution. In dynamic decomposition, the set of processes changes
during execution. Static decomposition algorithms offer the possibility of very
low process communication, provided the number of processes are small; however,
WWW.Gitmgurgaon.blogspot.co
616 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Example 8.9
var W, Y: shared real; var S_, Ss, semaphore; initial S$, = Cc, =O;
cobegin
Process P,: begin
V<«A~*«B; // stage1 of P, //
PS):
Z—V+Y¥>
// stage 2 of P, //
end
Process P,: begin
W<- CD; // stage1 of P, //
V(S,):
end
Process P ,: begin
X—1+ G;// stage 1 of P, //
P(S,)};
Y<«W
+X, // stage 2 of P, //
V(S):
end
coend
Clearly, the activation of the second stage of process P; is subject to the condition
that process P, is completed. Similarly, the second stage of P, cannot be initiated
unless the second stage of P; is completed. Hence, the set of processes P,, P,, and
P; 1s a synchronized parallel algorithm.
Since the time taken by a stage of a process is a random variable, synchronized
algorithms have the drawback that some processes may be blocked at a given
time, thereby degrading the performance of the algorithm. To illustrate the effect
of the drawback, consider a synchronized algorithm with » processes. Assume
MULTIPROCESSING CONTROL AND ALGORITHMS 617
Process P, —
aa “ NL
.” Process P,
Process P,
“| ZaVrY
“.. Stage 2 of P, /
‘ ¢
va oa
WWW.Gitmgurgaon.blogspot.com
618 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
t. In general, T is larger than ¢. The ratio T/t = A, is the penalty factor for syn-
chronizing the identical stages. If the penalty factor is large, the performance of
the synchronized algorithm is largely degraded. The speedup bound S, and the
penalty factor 4, give some indications of the performance of synchronized
algorithms.
where the x,, b are vectors each of size n and Ais ann x a matrix.
A common application of iterative methods is in solving elliptic differential
equations with boundary conditions (boundary-value problems). A simple but
important equation of physics is Poisson’s equation:
Ou aut
ae ttaaye =
sa FO)
f(y (8.27)
8.27
When f(x, ¥) = Q for all x, y, this equation is known as the Laplace equation. The
boundary-value problem consists of finding the function u(x, y) that satisfies
Eq. 8.27 within a closed region D and conditions prescribed on the boundary of
D. Let D be a square domain in R?, Also let the value of » be fixed on the boundary
of D (denoted by D), That is u(x, y) = f(x, ¥) for all x, y ¢ DB. This is the Dirichlet
problem. To solve this boundary-value problem on a digital computer. the domain
D is sampled by superimposing a rectangular m by # grid or mesh. The distance
between any two mesh points on any horizontal or vertical line is the mesh width,
denoted by fA. Eff is small enough, we can approximate Eq. 8.27 by
WWW.Gitmgurgaon.blogspot.com
620 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
parallel, and in the matrix iteration of Eq, 8.26, all the components of the vector
x,,, Can be computed simultaneously. Another strategy to implement Eq. 8.31
on a multiprocessor is to exploit the fluctuations in the speed of a process. In this
case, the idea is to use more than one process to compute the same function in
parallel and expect that the process which obtains the result first takes less than
the average time.
In the following discussion, we give an example of a synchronized algorithm
that locates the zero of a monotonically increasing continuous function f(x),
It is assumed that f(x) has opposite signs at the endpoints / and « such that the
interval is | — /|,as shown in Figure 8.28. The process terminates when |u — 1} < ¢,
a permissible error. It should be noted that the algorithm presented can be easily
modified to deal with discrete f, and thus can be used to search for a desired item
in an ordered list.
The zero-searching algorithm is iterative and consists of n slave processes and
a master process. The master process divides the given interval w — / into a+ 1
subintervals, each of size A = [(u — Jn + 1)|. Each slave process i evaluates the
function at x; + 1+ iA as a stage of the process. When all of these evaluations
complete, the master process compares the computed function values for the sign
Figure 8.28 Finding the zero of a function f(x) with 10 processors, where processor p, evaluates f(x) at
x=il+ fALA = lw — Nite t+ B.
MULTIPROCESSING CONTROL AND ALGORITHMS 621
Example 10
real procedure rootf(f, |, u, n)
begin
function f;
var A, fu, y[1:n]: shared real:
var i: shared integer;
while |u- /| > edo
begin
A«-|u- ///(n + 1); // compute subinterval //
parfor i=1 until n do // create n slave process //
begin
y[i] = f(/ + iA): // evaluate function, f(x) //
end
/—f+A;i< 1; // obtain new interval of uncertainty //
while sign (y[i] = sign (y[i + 1]) do
begin
faef+Asie i+ 1;
end
usta;
end
ze (/ + u)/2; // zero of function, f(x), is z //
end procedure rootf
The key feature of this algorithm is the synchronization that occurs between
the slave processes. Each slave process which completes its evaluation of the func-
tion is blocked. The two sequences of statements “J+ + A;ie-i + 1:” are not
executed until all n slave processes have completed their evaluations of the function,
Every slave process is awakened from the blocked state when the next iteration
begins, and all the slave processes become eligible to resume execution simul-
taneously. The nature of the parallel solution demands this synchronization
policy.
Let the time needed to evaluate fat a point in the interval be a random variable
t with mean t, and time needed to determine the new uncertainty interval and to
check the stopping criteria be another random variable c with mean ¢. For this
example, we assume that § > ¢, so that c can be ignored in the analysis. It is also
assumed that the execution time of the synchronization primitive can be ignored.
In evaluating the relative performance of the synchronized parallel zero-
searching algorithm, we note that, on a uniprocessor, the binary search produces
the best known search method and takes at most [log, | -- /|] function evalua-
tions. Hence the expected running time is [log |u — /[]e.
WWW.Gitmgurgaon.blogspot.com
622 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
For the synchronized parallel algorithm, it is clear that every iteration reduces
the length of the interval of uncertainty by a factor ofa + 1, when nslave processors
are used. Therefore, the algorithm uses [log,,, ,}« — /[] iterations and is optimal.
However, the expected time for each iteration is 4,1 rather than ¢, where 4, is the
penalty factor of synchronizing n function evaluations. Therefore, the expected
running time of the algorithm is [log,, ,|u — /|]-4,¢ Since the speedup is of the
order of log », it performs poorly for large n. Hence when v is large a different
search scheme that is efficient must be devised. The synchronized parallel algor-
ithm can be inefficient when A, is large also, which usually happens when # is
large.
contain the current values of f(x), /’(x), and x, respectively. For example, after
the ( + Dth iteration of Eq. 8.25, f(x,- ,), f’Ox;_ ,), and x; are updated as f(x,
I(x), and x,, ,, respectively, Suppose that the evaluation of f(x) is computationally
more expensive than that of f(x), then a reasonable asynchronous iterative
algorithm consisting of two processes P, and P, can be defined as follows. Let
process P, update variables v, and v,, while process P, updates v,. The program
below shows a sketch of processes P, and P,.
Example 8.12
function f, f';
var v,,¥,,V,: shared real;
cobegin
Process P,: begin
while <termination criteria S not satisfied> do
begin
v, « f(v,); // step 1 of P, //
Vg V3 7 V3 v,, // step 2 of P, //
end
end P,
Process P,: begin
while <termination criteria S not satisfied> do
v,-f'(v,); // step 1 of P, //
end P,
coend
From the program it could be seen that, as soon as a process completes up-
dating a global variable, it proceeds to the next updating by using the current
values of the relevant variables without any delay. Suppose that the iterates are
labeled in the order they are computed by step 2 of process P,. Then, in general,
the iterates generated do not satisfy the recurrence relation of Eq. 8.25. For
example, ifthe initial values of the variables are vp, = f(x9), 02 = f"(xp) and v, = x,,
then the sequence and time period of step completions for each iteration within
each process may be illustrated by a timing diagram, as shown in Figure 8.29. The
number i on a demarcation on the timing diagram indicates the point where the
ith iteration starts for that process. Then, for this illustration,
XK. =X — f'(Xo)
fF Xo)
Xy ee X2 — f'(6y) fa)
Xq = Xa 7 f'(%Q)7 fx)
From the concurrent program given above for P, and P,, the recurrence relation
that is generally followed by the execution of the processes 1s
o
P: FQ) SO) 8 Sy) 1 Xa
a
T
the properties of the sequence {x,} because of the fluctuations of the speed of a
process. Moreover, since the iterates generated by an asynchronous iterative
algorithm in general do not satisfy any deterministic recurrence relation such as
Eq. 8.25, it is difficult to obtain a general theory concerning conditions for con-
vergence or the speed of convergence.
The design of an asynchronous iterative algorithm for a general iterative
process (Eq. 8.31) involves the identification of some set of global variables
fo{1], e[2],..., elv]} such that each iterative step can be regarded as computing
the new values of the v/s from their old values. In general, it is desirable to choose
the v,'s so that the updating of each pv, constitutes a significant portion of the work
involved in one iteration, For example, consider the matrix iteration of Eq. 8.26.
In this case, e[i]’s may be chosen as segments of equal size of the components in a
vector iterate. After the eff]'s have been chosen, concurrent processes which
update the ofi]’s asynchronously can be defined as follows. Suppose there are rn
elements each in the vector x, and b. The set of global variables {v(1], e[2]....,u[r]}
can be partitioned into p subsets, each of size n/p = s (assuming that p divides i).
The kth process updates the subset {c[(k — 1)s + 1]...., e[ks]}, where off]
represents the current value of the /th component of the vector x;. Below is a p-
process (Eq. 8.31) involves the identification of some set of global variables
algorithms to solve the linear system of equations represented by Eq. 8.26.
Example 8.13
WWW.Gitmgurgaon.blogspot.com
626 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
4.0 ¢~-
3.0 ‘ me
‘ MEDUSA
‘N
‘N
\‘
‘
oh ‘\
3 \
v 2.0 ‘\
a NS SAL
~~ STAROS
1.0 r
l l 1 l 1 1 i 1 1 I l J
0 t 2 3 4 5 6 7 8 9 10 i 12
Number of processors —~—»
Figure 8.30 Speedup of parallel quicksort algorithm on Cm*.
MULTIPROCESSING CONTROL AND ALGORITHMS 627
system features. Medusa and Staros are two different operating systems designed
for the Cm*. Notice the effect of the operating system on the performance of the
algorithm. This is due to the overhead incurred by the invocation of the operating
system functions for process scheduling and other chores. The performance peaks
for a degree of decomposition between 6 and 8. This is due to the heavy contention
of references to all the shared data located in one of the computer modules.
$2.0 5°
41.0
10.0
T
Ss
~o
Speedup ——+
t
~
>
ee
\ oa eceeeccoee._ MEDUSA
nN
o
T
Ameen
\
92
3
mae |
=>
T
o
WT _STAROS
o
t
y
me ares
N
T
° >
t
| 1 Po ! bl neces
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36
Number of processors. —+»
Figare 8.31 Speedup of purely asynchronous PDE algorithm on the Cm*.
WWW.Gitmgurgaon.blogspot.com
628 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
and operating system environments. As long as most of the processes run in the
cluster contain the global data, the speedup is almost linear. Otherwise, the speedup
resembles that of the slowest process. In general, the distribution of the global data
among the clusters will affect the performance. Also, the convergence depends on
the relative speed of the various processes evaluating different parts of the grid.
In general, asynchronous algorithms, if available, are preferred to synchronized
algorithms provided the computations converge to the desired result.
The power of analytical models resides in the estimation of the impact on the
performance of a given feature or subset of features in isolation,
A symmetric multiprocessor system is made up ofa set of identical processors.
Generally the computation on each processor consists of a random number of
instruction executions. The randomness in this number results, for example, from
the evaluation ofa function whose definition varies over its domain or ofa function
computed by a series expansion. In a minicomputer or a microprocessor, an
instruction cycle consists of a variable number of machine cycles. Typical machine
cycles are instruction fetch, operand fetch, and execution cycle. We will thus
distinguish between memory access cycles and execution cycles. When a request
for a memory word is rejected because of conflicts, the processor automatically
retries by initiating a new memory access cycle. Memory access time fluctuations
result. The execution part of an instruction cycle also has a random duration. Its
duration may depend on the operand values. The number of machine cycles in
the execution of an instruction is thus random, resulting in execution time
fluctuations.
Since we are strictly concerned with the modeling of synchronized iterative
processes, we concentrate on the efficient implementation of one iteration of any
iterative algorithm with a given structure and size on an MIMD machine. In this
framework, we consider the input data set for an algorithm described by Eq. 8.31
as including both the iterates x, and the parameters defining ®. In the case of an
LSE, for example, these parameters are the system coefficients. Another cause of
processing time fluctuations is the occurrence of external interrupts and page faults
in the local or shared memories. To effectively isolate the performance of the
algorithm on the architecture, we assume that each processor is uniprogrammed
and that the memories are large enough to accommodate the address space of cach
process so that page faults do not occur. Moreover, external interrupts are disabled.
A loosely coupled multiprocessor system is one in which the processors access
their instructions and data in their local memory. Thus, no memory access time
fluctuation exists in this system. To communicate, the processors can initiate a
data block transfer through their direct memory access (DMA) gate and high-speed
bus (HSB) with broadcasting capability. The DMA gate has a fast communication
memory (CM) which can be accessed also by the processor. To send a message
to other processors, a processor stores the message in the CM, then initiates the
transfer. The DMA controller of the sender monitors the bus. When it is free, a
connection is established on the bus with the DMA controllers of the receivers,
in time f4,. The message is transmitted on the bus and is simultaneously read by
the receivers in time t,,. The total time the bus is busy for one message transfer is
thus
WWW.Gitmgurgaon.blogspot.com
630 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
without conflicts. Under these conditions, t,, from Eq. 8.34 is deterministic for a
given transfer.
To implement a synchronized algorithm on a loosely coupled system by
vectorial decomposition, cach of the P processors updates its subset of the iterates,
then sends the values to the (P — 1) other processors through the HSB. When a
processor has received the updates from all the other processors, it can proceed
to the next iteration.
Assume that the iterative algorithm is decomposed vectorially into P iid.
tasks. The P processors iterate through cycles in which they compute their subset
of the iterates (processing phase), eventually communicate, and synchronize, as
illustrated in Figure 8.32. The performance index is the efficiency factor E, defined
as the fraction of time a processor is doing useful work. In a loosely coupled system,
useful work is done during the entire processing phase only. This is not so in
tightly coupled systems, where the cycles wasted in memory conflicts have to be
deducted. This definition of the efficiency isolates the effect of the architecture on
the performance. To compare the performance of the parallel algorithm with its
corresponding sequential version, we would multiply the efficiency as defined here
by a factor taking into account software restructuring or added software overhead
in each iteration and by the ratio of the number of iterations required in both
cases.
The techniques used to model the effect of synchronizations are drawn from
order statistics. Let Tj,» be the processing time of the jth processor to terminate
Gn the chronological ‘order) when P processors are used. The estimation of the
mean of T),p is equivalent to the estimation of the mean of the jth order statistic
among P independent samples drawn from the processing time distribution. For
many distributions of interest, and for Lid. processes, the mean of T),p is given by
My,» = Mo + Opp: Fo (8.35)
where m and o are the mean and variance of the processing time, and 0,.p is the
mean of the jth order statistics among P samples drawn from the processitig time
distribution with mean Q and variance |. For example, for a uniform distribution,
Ran Pat
On = V2 Pai
Communication phase
Processing phase
Send Receive
Compute iterates —e-{ iterate ~~} iterate
values values
a‘
. Read the iterate updates received during the previous iteration in CM[1].
Ne
WWW.Gitmgurgaon.blogspot.com
632 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
where E(max) is the average of the maxima and F(x;) is the average of x,. According
to this result, we write
m, = Max [ng + Opp: do + (P ~ J + Deeg] (8.38)
Jed P
Taking the lower bound for m,, we can derive the efficiency as:
R= Mo Mo __! (8.39)
“my Max [ity + Oj,po) + (P—J+ Dtol 1+A
Se lye P
with
ACP, Co, ten) ~~ Max [0;.p°Co + OP f+ Vtg]
J Love P
The set of performance features are P, Cp, toy = fop/g and Cy = aofitg.
The efficiency factors for the loosely coupled system are displayed in Figure
8.33 as a function of the features. From these figures, it is clear that the high
sensitivity of the performance index to the communication to computation
times ratio *, limits the applicability of loosely coupled architectures for syn-
chronized iterative algorithms. Such architectures are well matched for algorithms
with a low communication to computation times ratio. However, when this ratio
increases beyond a few percent, the efficiency decreases rapidly. This property
teflects the power of a high-speed bus with broadcasting capability.
The above methodology can be used to study the degree of matching between
an algorithm and a multiprocessor architecture. This methodology is based on
extracting performance features for a class of algorithms and an architecture from
an approximate analytical model. The features define a multidimensional space. A
performance index is then a mapping from the feature space on the real line. Given
the architecture, algorithms pertaining to the class defined by the hypothesis of
the model can be partially ordered according to the value of the performance
index. This ordering allows us to locate regions in the feature space where the
architecture is well matched to algorithms in the class.
The difficulty of this approach is in striking a proper balance between the
simplicity and tractability of the analytical model and its accuracy. As modeling
MULTIPROCESSING CONTROL AND ALGORITHMS 633
Cyl) =
0 | | | | L i 1 ]
0 2 4 6 8 10 12 14 16
Communication to computation times ratio, tin percent)
Figure 8.33 Efficiency versus feature 7,, for a loosely coupled system with a high speed. (Courtesy /EEE
Trans. Software Enge., Dabois and Briggs, July 1982.)
tools improve, the analytical model may be refined. Even if approximate, the
feature space approach is much more realistic than complexity studies.
In Figure 8.34, cuts through the feature space are displayed. These cuts are
two planes for each case. The index function E is represented by contours of equal
index value. Loosely coupled systems are effective for processing-intensive
computations (low values of feature t.,). The regions with a high-efficiency factor
shrink as the number of processes increases. Visualizing the feature space by cuts
such as in Figure 8.34 is of great help in understanding the interaction between
the architecture atid the class of algorithms. Other architectures could be studied
using the methodologies discussed.
The estimation of E is very important to the software designer for MIMD
systems, since it is the proportionality factor between my and m,, the average
iteration time (see Eq. 8.38 for example). As a result of the analysis, a given imple-
WWW.Gitmgurgaon.blogspot.com
634 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
20 T 1 T T T T T
4
Solid line: P= 16 E= ob 877]
Dashed line: P=4
=
2
& 15 |
5 - .
8 ae 0 ge %
0.4]
to computation times ratio, r
\
‘
Communication
0 3 10 15 20 25 30 35 40
Coefficient of variation Cy (in percent)
Figure 8.34 Feature planes for loosely coupled system with high-speed bus. (Courtesy of JEEE Trans.
Sofiware Engg., Dubois and Briggs, July £982.)
mentation may be revealed as inefficient for the architecture and may have to be
restructured.
distribution for the time taken by a computation unit for the same reasons as
mentioned previously. This property is important so that the underlying processing-
lime distribution is preserved for all partitions of the computation. To obtain a
decomposition of a homogeneous computation in iid. tasks, we partition the
computation into sets containing the same number of units. Each set defines a
process. The number of such processes in the partition is the degree of decomposi-
tion, denoted by P since one processor is devoted to each process. The maximum
degree of decomposition (PMA) is obtained when each process contains only
one computation unit.
A simple model for the mean and variance of the number of active cycles in a
process ofa homogeneous computation as a function of the decomposition is
its 2 oh
Mo = Ma + and a= 04+ P (8.40)
m, and o4 are the mean and variance of the fixed overhead independent of
the decomposition, while mg and o% correspond to the mean and variance of the
partitionable part of the computation.
As the degree of decomposition increases, the number of iterates to be com-
puted by each process reduces proportionally (accounting for the second term in
Eq. 8.40). The overhead term might include the initiation ofa transfer through the
high-speed bus. In most implementations, making a private copy of the iterate set
will be required at the beginning of each iteration and should be accounted for in
m, and o3.
For the loosely coupled system, the block transfer time is modeled by
t
top = tay + (8.41)
Asin Eq. 8.35, the transfer time is deterministic and assumed identical for all
the processes. The first term of Eq. 8.41 represents the transfer overhead (time
between the reservation of the bus or memory module and the actual beginning of
a transfer) plus the time taken possibly by the transfer of a fixed amount of in-
formation independent of the decomposition. Since the number of iterates com-
puted by each process decreases proportionally to the decomposition, the transfer
time decreases accordingly. Equation 8.41 means that the time to initiate a transfer
t4, and the speed of the transfer are independent of the decompasition, an
assumption not always verified for the bus system but nonetheless simple and
realistic i most cases. The speedup, denoted by Sp, is defined as the ratio of the
times taken by the algorithm on a uniprocessor and on a multiprocessor system
when the degree of decomposition is P (with maximum PMAX). It is computed
as follows for the case when m, x 0% = 0.
For the loosely coupled architecture with a high-speed bus, we have
P
Sp = (8.42)
P14 A>
WWW.Gitmgurgaon.blogspot.com
636 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
with
Note that, contrary to the general study discussed earlier, the parameters are
normalized in this section with respect to the total computation mig.
The speedup curves are displayed in Figure 8.35. C, (the coefficient of variation
of the total computation) is shown to limit the optimum decomposition in most
cases, Loosely coupled systems have a very good speedup when fy, is small
(0.1 percent) and t, ~ 0. However, a nonnull constant term (Figure 8.36) in each
64.07
Solid line: C,
= 1 percent
56.07
Dashed line: C,= 2.5 percent A
tH, |
48.0F
we
40.0 - we
Aye
wd
z - a
=o 320+ .?
a leet
-
a oa wot
B s wenn
24.0 we
r soe
16.0F fo
10
8.00-- 10
0,00 1 1 | —__L 1 . J J
0.00 §.00 16.0 24.0 32.0 40.6 48.0 $6.0 64,0
Degree of decomposition P
Figure 3.35 Speed-up versus decomposition for a loasely coupled system with a high-speed bus. (Courtesy of
IEEE Trans, Sofiware Engg. Dubois and Briggs, July 1982.)
MULTIPROCESSING CONTROL AND ALGORITHMS 637
40.0 -
Solid line: O,= 1 percent
t= -05 percent
24.0 fe taal) =
a zi
S
zg race cone enn nme OS BOS
Oe Re a
wh ene enna fn
16.0- ae
nt geen"
8.00 10
10
0.00 l 1 Ll 1 1 Il j }
0.00 8.00 16.0 24.0 32.0 40.0 48.0 56.0 64.0
Degree of decomposition ?
Figure 8.36 Speed-up versus decomposition for a loosely coupled system with a high-speed bus. (Courtesy
of FEEE Trans. Software Engg., Dubois and Briggs, July 1982.)
transfer causes the speedup to peak. In this latter case, the optimum decomposition
is limited to a degree of 16 to 40 for the examples considered.
The speedup conceivably peaks out or saturates when the decomposition
increases. This intuitive reasoning is confirmed quantitatively by the curves of
Figure 8.34, These results can be used as a guideline by the compiler or the user
for an effective decomposition of an MIMD iterative problem into tasks of similar
statistical properties.
WWW.Gitmgurgaon.blogspot.com
638 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Problems
8.1 Describe the following terminologies associated with multiprocessor operating systems and
MIMD algorithms:
(a) Mutual exclusion
(b) The TEST-AND-SET instruction
(c) The ENQUEUE and DEQUEUE operations
(¢) The P and V operators
(e) Conditional critical sections
Cf) Deadlock prevention and avoidance
(g) Deadlock detection and recovery
MULTIPROCESSING CONTROL AND ALGORITHMS 639
Fori=1tondo
cobegin
$,[i7: $,[i];...8 Gi;
coend
k processors are devoted to the above computation, Each computation S,{i], for | <j < k takes an
unpredictable and tandom time to execute. Complete the program between the begin and end state-
ments for processj using critical section (csect) on the shared variable, pcount, and P and V operations
on the binary semaphores s. The primitives implement the synchronization of these & processors.
Hint: The last processor to finish one iteration for one valuc of t updates pcount and “awakens” the
other blocked processors through s.
{Process jj:
For —1tondo
begin
8.4 Assume thats, and jG are pointers to two sorted lists, cach arranged as a circular doubly linked list
with headnodesf| and f,, respectively. f, andf, are sorted in ascending order and each node has three
fields, namely. LLINK (left link), DATA, and RLINK (right link) (Figure 8.37), Write an asynchronous
MIMD algorithm for two processes P, and P, to merge the files, and/; into a sorted list/, also arranged
as a circular list. Process P; retrieves from the fronts off; and f; to produce a sorted sublist ¢), while
P, retrieves from the rears off, and f; to produce the sorted sublist d, until etther f, orf, becomes
empty. Attach the nonempty list to d, and d, to producef. Return any unused node x to the storage
pool with the operation POOL (x). Note that sublists 4, and d, are initially empty.
WWW.Gitmgurgaon.blogspot.com
640 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
¥
2t> Sar,
G rae oes xl —* Vy
Rear
4, a,
GAD y
Empty list d,
GAY Ye
Empty list ¢,
Figure 8.37 Paraltel merging in Problem 8.4.
y=A-x+b,
where yisn = 1,Aisn xn, bisn ~ 1
PROCESS:
(Parfor i — untiln ~ 1 do
ty {i) — 0;
For
j — O until n ~ 1 do
i) — y (i) + A {i,j} + x);
y¥ (i) ~ y (i) + b (i);
}
Assume that each assignment of the variable (in the Parfor statement takes c seconds. The time it
takes to start of spawn a new process for a given is a sum of independent random variables $j... x, ~ ic
where x; is exponentially distributed with mean time 1/(j + 1)4. What is the speedup Sy, of the parallel
process if multiplication and addition takes ¢,, and /, seconds, respectively, on each processor and n
concurrent processes are used ? Also assume that ne < 1/A. Then plot S, versus n for values of 1/AT =
0.1, 0.5, 0.8, 1, 2, 5, and 10.
8.6 Dijkstra’s problem of dining philosophers (slightly gencralized) is: There are # philosophers whose
lives consist of alternately thinking and cating. The philosophers eat at a large circular table with a
preassigned plate for each. Between two plates is a fork, which may be used by either adjacent philoso-
MULTIPROCESSING CONTROL AND ALGORITHMS 641
pher. In order to eat, a philosopher must have two forks (one on his left and the other one on his right).
Devise a control program of the general form: Note that the n philosophers may not necessarily follow
the same contro] program to claim and release the forks.
where (a} and (b) are code sections which claim and release the two forks, respectively. You should
specify these two code sections such that the following properties are met:
(a) Use P-¥ for communication and synchronization.
(b) Allow a fork to be held by only one philosopher at a time.
(c) Use strictly local information, e.g, LEFT.FORK and RIGHT FORK to indicate the two
forks (resources) at both sides of each philosopher (process).
(d) Guarantee that no philosopher will starve, that is when the n processes enter a deadlock
situation.
8.7 Solve the dining philosophers problem using (a) generalized P and V, (6) conditional critical
regions.
8.8 Using P-V operations, write the synchronizing program for the task graph shown in Figure 8.38,
where T; is the statement that controls the execution of task T;. Solution should be of the following form:
WWW.Gitmgurgaon.blogspot.com
642, COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Define semaphore;
Initialization of semaphores:
cobegin
T;:
TT:
To:
coend
8.9 (a) Which combination of processes cause deadlock in the five process code segment programmed
below? The processes are A, B,C, 1D. and B.
Begin
shared record
begin
var S,,5,,8,,5,; semaphore;
var biecked, unbiecked: integer;
end
initial blocked = 0, unblocked = 1;
initial S, = 5, = S, = unblocked, S, = blocked;
cobegin
A: begin P(S,); V(S,}, P(S,): V(S,} end:
B: begin P(S,), P(S,); V(S,); V(S,); V(S,) end:
C: begin P(S,); P(S,); V(S,); V(S,) end
D: begin P(S,); P(S,}; P(S,): V(S,}: V(S,) end
E: begin P(S,); P(S,); V(S,); V(S,) end
coend
End
(b}) Besides the combination of processes mentioned in your answer to part (a), which additional
processtes) could be indefinitely blocked because of this deadlock?
(c) For the processes given, is deadlock inevitable, or does it depend on race conditions? Justify
your answer,
(4) Assume that the skeletal code segment programmed above is an abbreviated version of a
more complex program and that only the details of the semaphore-related code is shown here, Guarantee
that all five processes in the real program complete by making a minor change to one of the skeletal
processes.
8.10 Write a paraliel algorithm to implement the concurrent Quicksort algorithm described on pages
625-626.
8.1t Show that the use of PE and VE could result in a deadlock or possibly starvation in a system of
resources.
CHAPTER
NINE
EXAMPLE MULTIPROCESSOR SYSTEMS
Multiprocessor systems can be divided into two classes: the exploratery research
computers and commercial multiprocessors. We consider a system to be ex-~
ploratory if it is developed mainly for research purposes or for dedicated missions.
Commercial multiprocessors are those systems that are available in the computer
market. A summary of existing multiprocessors is given below. We leave the
details of each system to subsequent sections. Some of the systerms were studied
in the previous two chapters.
WWW.Gitmgurgaon.blogspot.com 64s
644 COMPUTER ARCHITECTURE ANT PARALLEL PROCESSING
study the C.mmp architecture and its specially developed Hydra operating system
in Section 9.2. The hierarchically structured Cm* has already been described in
Chapter 7. Fhe Cm* ts still being used at CMU as a research vehicle. The C-mmp
is no longer in operation now.
Another crossbar-structured multiprocessor is the $-1 system currently under
development at the Lawrence Livermore National Laboratory. It is a 16-processor
system, However, each uniprocessor in the S-1 is custom designed for a mult-
processing environment. The 8-1, once completed, should be a gigaflops machine.
We shall study its processor characteristics and software development in Section
9.3.
This section reviews the architectural features of the C.:mmp system and the kernel
of its Hydra operating system, Reported performance of the C.mmp will be also
examined in Section 9.2.3.
WWW.Gitmgurgaon.blogspot.com
646 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
(S.mp)
Crossbar switch
(16 x 163
<
py 4 Interprocessor bus >
|
Kinterbus Kelock
instructions are HALT, RESET, WAIT, RTI (return from interrupt), and RTT
(return from trap).
Further modifications were made to permit address-bounds checking on the
stack pointer register R6. These modifications were required for software protec-
tion. The operating system is required to deposit some context information on the
stack over protected procedure calls. RTLand RTT were modified since they modify
the processor status, which must be protected because it is used to control the
memory protection scheme. The PDP-11/40E processors were modified further to
allow an extended writable control store.
Each processor has an 8K-byte local memory that is used primarily for
operating system functions. The principal secondary memories of the C.mmp
consist of four drives of 40M-byte disk controllers, three drives of 130M-byte disk
controllers, and fixed head disks with zero latency controllers that are used for
paging space. The peripheral devices are assigned to the Unibus of specific pro-
cessors, as shown in Figure 9.2 for one processor. Hence there is no physical sharing
of peripherals. A processor cannot initiate an I/O operation on a peripheral that
EXAMPLE MULTIPROCESSOR SYSTEMS 647
To crossbar
switch
Treen"
Cache ;
Local ge
(cre) memory oO eee HO Kibi
is not on its Unibus. Fortunately, the operating system hides many of the asym-
metries of the I/O subsystem from the user.
An interprocessor bus which connects the entire set of processors is used to
perform the general function of interprocess communication. The bus provides a
common clock (Kelock) as well as an interprocessor control (Kinterbus). These two
logically and functionally separate features travel separate data paths, although
they share a common control. Each processor has an interbus interface (Kibi) that
defines the processor’s bus address and makes available the bus functions to the
software. The bus provides three basic functions, as described below.
The first function is to continuously broadcast a 60-bit 250-kHz nonrepeating
clock (Kclock), This is done by multiplexing the clock value onto a 16-bit wide
data path in four time periods, with low-order bits first. Any Kibi requesting a
clock read waits for the initial time period and then buffers the four transmissions
in four local holding registers available to the software. Clock values are often
used for unique name generation in the operating system. The otherwise unused
high-order four bits of the fourth local register are set to the processor number
WWW.Gitmgurgaon.blogspot.com
648 COMPUTER ARCHITECTURE ANLI PARALLEL PROCESSING
(bus address) to insure uniqueness when any number of Kibi’s read the bus simul-
taneously, A countdown register is also maintained in each Kibi for interval timing.
It may be initialized by a nonzero value in the program; a one is subtracted every
16 ps (timing supplied by the Kclock) and the process is Interrupted when the
register reaches zero.
The second and third bus functions are the interprecessor interrupts at three
priority levels and the control mechanism. Each processor may interrupt, halt,
continue oer start any processor, including itself. These functions are used only
when a drastic action such as systemwide reinitialization is necessary. The control
Operations are invoked by setting the bit(s) corresponding to the processor(s) to
be controlled in a 16-bit register provided by the Kibi for the desired operation.
A second 16-bit wide data path is eight-way time multiplexed. Each control
operation is assigned a time period. As the appropriate period arrives, each Kibi
ORs its control operation register onto the bus and clears the register,
Synchronization of bus accesses and operation specification are accomplished
by the multiplexed time periods. The Kibi also inspects the bus to see if the specified
operation is being invoked on its processor; if so, the action is performed. Setting
the ith bit of the Kibi register to one associated with one of the functions will evoke
that function on the ith processor. Thus, for example, moving a mask of all Is into
the halt register in each Kibi will stop the entire system. Although eight time
periods are available, only six are used: three priority levels of interprocessor
interrupt, halt, continue and start; the remaining two are ignored.
Probably the greatest limiting factor in building a large computing system
from minicomputers is their small address space. In most cases, it is required to be
able to address several million bytes of primary memory from the processors. The
basic PDP-11 architecture is only capable of generating 16-bit addresses, Although
the processor may generate only a 16-bit address, the Unibus supports an 18-bit
address, and the shared memory uses a 25-bit address. An address relocation
hardware Dmap associated with each processor performs the memory address
translation. Its relationship with other bus components is shown in Figure 9.2.
The processor-generated addresses are divided into eight pages, where each page
is an 8K byte unit. Unibus addresses are divided into 32 pages, and the shared
memory is divided into 4096 pages.
As shown in Figure 9.3, the two extra bits of the Unibus address are obtained
from the program status register (PS) in the processor. These bits may nat be
altered by any user program. The user programs are actually bound to operate
within the eight pages described by a subset of relocation registers. Such a subset
is called a space and ts named by the two bits <7: 8> in the PS. With these two space
bits, four address spaces can be specified as (0, 0), (0, 1), (1, 0) and (i, 1). Therefore,
four sets of eight registers are provided in each relocation unit, although the stack
page is common to all spaces to allow communication across spaces. One of these
eight registers in a given address space can be selected by using the high-order three
bits ¢13:15) of the 16-bit processor address werd.
The four address spaces are the heart of the memory-protection mechanisms
discussed later. The address-mapping registers and PS registers are both located
EXAMPLE MULTIPROCESSOR SYSTEMS 6§49
in the peripheral page, which is addressable via the (1, 1) space bits. The relocation
registers in the space described by the (1, 1) space bits are the only ones that are
directly addressable and are used exclusively by the kernel of the operating system.
Hence, protecting the PS guarantees that no addressability changes may be made
without the approval of the operating system. Direct addressability is accomplished
by disabling two of the relocation registers in (1, I} space, one each of Mlocal and
the control register bank, for all peripheral devices (including Dmap). With these
registers disabled, addresses pass along the unibus unchanged to be received by
the addressed register or memory location.
Access to shared memory is performed in two stages: The relocation of the
18-bit processor-generated address into a 25-bit address space, and the resolution
of contention in accessing that memory location. As illustrated in Figure 9.3, the
Dmap intercepts the 18-bit unibus addresses (16-bit word plus the two space bits)
and translates them as follows: the three high-order bits of the 16-bit word select
a register from the bank specified by the space bits. The contents of the register
provide a 12-bit page frame number; the remaining 13 bits from the address word
are the displacement within that page. The two are concatenated to form the 25-bit
mapping shared memory address. This transparency is performed for all memory
accesses,
In addition to the 12 page frame bits, there are four bits in each relocation
register used for control. They are designated as no-page-loaded (nonexistent
memory), write-protected (read-only), written-into (dirty), and cacheable bits to
WWW.Gitmgurgaon.blogspot.com
650 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
control whether values from the page may be stored in a possible per-processor
cache, The per-precessor cache was planned but not implemented. The cacheable
bit would have been used by the operating system to avoid cache consistency. This
can be accomplished by indicating pages that are not both shared and writable.
Shared and writable pages are never cached.
The shared memory address and (possibly) 16 bits of data, each parity checked,
and two bits of access function data are sent to the cross-point switch. The address
parity is checked at the switch interface. If the check fails, the request is aborted
and the processor interrupted. Data parity is not checked until the data is read
from memory. All parity is generated and data parity checked by the relocation
unit (Dmap) interface to the bus from the switch,
The switch then routes the request to the memory port specified by the high-
order four bits of the address. A port is requested by setting the processor’s bit in
an initial request register. Contention for the port is resolved by periodically
gating the request register into a queue register, which ts left-shifted as the port
becomes available. The shifting creates a priority ordered queue: As a bit is shifted
out, the corresponding processor is granted access to the port. Processor 15 is
assigned the high-order bit; processor 0 the low-order bit, defining the priority.
When the queue register is zero, all requests have been satisfied. The request register
is again gated into the queue register, cleared, and a new cycle begins. A second
request for the same port by a processor must enter via the request register, hence
equality of service among the processors is maintained.
This two-level request mechanism also obscures the internal queue’s priority
ordering to the point that it is of virtually no importance outside the switch,
preserving the symmetrical design of the cross-point. The switch’s maximum
concurrency (16 independent paths) is achieved if all processors request different
ports. The cost of address translation, switch overhead (no contention), and
round-trip cable overhead is about I ys. This is high by today’s standards and is
more than equal to the access time of the memory.
WWW.Gitmgurgaon.blogspot.com
652 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
unique LNS for each invocation of the procedure. This LNS disappears after the
procedure terminates.
Moreover, a procedure object may contain templates which characterize the
actual parameters expected by the procedure. When the procedure ts called, the
slots in the LNS which correspond to parameter templates in the procedure object
are filled with “normal” capabilities derived from the actual parameters supplied
by the caller. This derivation is the heart of the protection-checking mechanism,
and the template defines the checking te be performed. If the caller’s rights are
adequate, a capability is constructed in the new LNS. This LNS references the
object passed by the caller and contains rights specified in the template.
System interrupts were provided in the C.mmp architecture to facilitate inter-
processor communication. An analogous software mechanism is provided at the
user level by the kernel. This mechanism, called contro! interrupts, is used for
interprocess communication. When it occurs and is directed to a process, the
receiver process transfers control asynchronously to a special address specified by
the user, This address is within the addressing domain in which the process is
executing at the time. A 16-bit mask is specified by the process that sends a control
interrupt. These bits are compared with a mask in the receiver process, and the
receiver process is interrupted only if there is a match between one or more bits.
Hence adequate protection is maintained, Control interrupts are generally slow;
however, they have some properties that make them desirable for errer recovery.
The operating system provides an elegant message system for handling com-
munications between processes and thus encourages the use of cooperating
processes, The message system uses objects called ports as gateways for processes
to send and receive messages between processes. Each port has a set of logical
terminals called channels which are used to connect between sender and receiver
processes. There are basically two types of channels: input and output. Messages
are sent from output channels and received in input channels. Two processes can
communicate if they each have a capability for the other’s port and if a com-
munication path is established between the ports.
To establish a path between the two ports, the output channel of one is
connected to the input channel of the other and vice versa. Each port provides a
message slot which is a buffer that is used to queue messages. This slot also provides
a local mechanism to name messages. The message system can operate in two
modes: nonacknowledgement and acknowledgement. In the first mode, the
sender sends a message and continues processing without waiting for a reply.
Operations in the second require a reply to a send message. A process that attempts
to receive a message before its arrival is suspended until the arrival of the message,
whereupon the process is unblocked. The message system is also used in I/O
communication to provide transparency of the asymmetries of the [/O structure
at the user level,
The message system is not very efficient, thus two other synchronization
mechanisms, locks and semaphores, are provided. Two types of locks exist in the
Hydra. The kernel lock makes use of hardware facilities such as interprocessor
interrupts which are not available at the user level and therefore can be used only
EXAMPLE MULTIPROCESSOR SYSTEMS 653
in the implementation of the Hydra. The spin lock is available to users as it does
not use privileged instructions to implement it. The kernel locks, which are used
primarily to provide mutual exclusion for operations on various system queues and
tables, pervade the implementation of the kernel.
The kernel lock consists of three components: the lock byte, the sublock byte,
and the processor mask word. The lock byte maintains a counter of the number of
processes wailing for the lock. A process which wishes to obtain the lock indivisibly
increments and tests the lock byte with a single PDP-11 instruction; if the result
indicates that the lock is free, it is then locked and the locking process can execute
its critical code. Otherwise, the process sets the bit corresponding to its processor
in the processor mask word and executes a WAIT instruction with all interrupts
except the highest priority IPE (interprocessor interrupt) disabled.
When a process is ready to unlock the lock, it indivisibly decrements and tests
the count in the lock byte: ifthe result indicates that no other processes are waiting,
the lock is unlocked and normal execution can continue. If other processes are
waiting, the unlocking process sets the sublock byte to one and sends an IPI to
every processor with a bit set in the processor mask. These processors resume after
their WATT instructions and indivisibly decrement and test the sublock byte. One
random processor will discover the count to be zero, remove itself from the mask
of waiting processors, and execute its process’s critical code. The other processors
go back to waiting.
One disadvantage of locks is that a process which waits during the execution
of a kernel or spin lock does not relinquish the processor on which the process is
executing. Kernel locks have another disadvantage which involves the overhead
of invoking lock and unlock primitives. This overhead is minimized by storing the
code for the kernel lock primitives in the processor's local memory. In this case,
there is no memory contention and no contention for the lock. However, spin
locks have one major disadvantage over kernel locks: When a processor is spinning,
it accesses shared memory and thus consumes memory bandwidth which might
have been used more constructively.
Generally, when the probability of waiting is low, the lock primitives are very
efficient. A study performed on the C.mmp indicates that, although a process may
spend over 60 percent of its time executing kernel code, only about 10 percent of
accesses to locks cause locking. Moreover, the total fraction of time spent by
processors waiting for locks is Jess than | percent. The study also shows that
operations performed on data structures while they are locked are small. The
overall average time spent in a critical section is about 300 ys.
Situations often arise in which lengthy blocking cannot be avoided in a system.
In such cases, using the lock primitive will result in excessive waste of resources.
For example, after a process issues an I/O request, a significant amount of time
may elapse before its completion. In other cases, relatively large sections of code
may require exclusive use of a data structure. In each case, when lengthy blocking
is possible, Hydra provides two types of semaphore mechanisms, kernel semaphore
(K-Sem) and policy semaphore (P-Sem). Both are implementations of the generalized
or counting semaphores, The main difference between the semaphore and lock
WWW.Gitmgurgaon.blogspot.com
654 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
WWW.Gitmgurgaon.blogspot.com
656 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
in the act of failure, and yet another failure often occurred soon after the system
was restarted. Such restarts are fast, typically taking less than 2 min; hence
relatively high availability was maintained. However, the loss of user jobs repre-
sents an extreme inconvenience. The suspect-monitor system’s capabilities was
improved by providing a system of processor-error counters. These counters
record the occurrence of particular errors on each processor. and if they exceed a
threshold for total errors or for a particular error class, the offending processor is
amputated or removed from the configuration or quiesced. The counters are also
periodically right-shifted, causing them to decay over time so that they measure
error frequency rather than simply providing an error count. A flaw in this scheme
is that error counters are maintained for processors only. The processor that detects
an error is charged with causing it, even if no concrete evidence to that effect exists;
the error may actually have been caused by another processor, bad memory, or
even by software. In practice, the high degree of symmetry among the precessors
makes it improbable that one processor will detect an error for which it is not
responsible with a high enough frequency to exceed its error threshold.
It was found that parity errors are the single most common failure mode. While
hard failures occur regularly, most parity failures are transient, suggesting that
perhaps error counters should be implemented for memory pages as well as for
processors. Although considerable emphasis was placed on error detection and
diagnosis in the C.mmp/Hydra, the recovery mechanisms are insufficient to
preserve integrity of smaller granules of computation.
To demonstrate the effect of memory contention in an execution environment
of the C.mmp, an experiment is described below. This experiment consists of
finding the root or zero of a function. The parallel algorithm used to solve this
problem was described in Section 8.4.2. There are two implementations of the
algorithm. In the first case, the code of the algorithm was stored in a single memory
page which was shared by all processes. The second implementation of the
algorithm provided separate pages of code for each process. Therefore, the first
implementation will encounter more memory conflicts. The experiments were
conducted on a C.mmp configuration consisting of Model 20 and 40 PDP-11s.
The Model 20 is typically 50 to 60 percent slower than the Model 40, Figure 9.4
illustrates the effect of memory contention on the performance of the two imple-
mentations of the root-finder algorithm. Notice that, beyond a certain threshold,
an increase in the number of processors will produce a negative effect on the per-
formance for the implementation with shared code page.
This algorithm was also used to study the performance of the various syn-
chronization primitives discussed above. Recall that the key feature of the parallel
algorithm is that it is synchronous. The nature of the parallel solution demands
the synchronization policy. Figure 9.5 shows the elapsed time required by the
root-finder algorithm fer varying numbers of slave processes and four different
synchronization mechanisms, For these measurements. the function-evaluation
time was distributed normally with a mean of 72 ms and a standard deviation of
18 ms. , The parameter ¢ refers to the wait-time constant for policy semaphores.
The curve labeled PMO corresponds roughly to the case where e = 0, While
EXAMPLE MULTIPROCESSOR SYSTEMS 657
300 -
f
i
275 é
fa
i
P
250 i
i
i
225+ :
;
Elapsed time (in seconds)
200+ i
9
Shared code page. ,-”
50 i 1 L L L L | t
1 2 3 4 5 6 7 8 9
Number of processes
Figure 9.4 Performance degradation due to memory contentions. (Courtesy of Oleinick, Carnegie-Mellon
University, £978.)
WWW.Gitmgurgaon.blogspot.com
658 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
$507
500
450
Elapsed time {in seconds)
PMO semaphore
300
250
200
150
PMI (e=300) sernaphore
semaphore
Spin lock
50 n 1 E ——_L J I i
| 2 3 4 5 6 7 8
Number of processes
many factors affect the performance of this algorithm, the damaging effect of an
inappropriate synchronization mechanism is clearly demonstrated.
The system architecture of the S-1 multiprocessor system is presented below. The
system is being developed under the auspices of the United States Navy. Described
below are the basic organization and characteristics of the uniprocessor Mark TEA
used in the $-1 construction. This processor is expected to have a performance
EXAMPLE MULTIPROCESSOR SYSTEMS 659
level comparable to the Cray-1. The S-1 consists of 16 uniprocessors which share
16 memory banks via a crossbar switch. Each processor has a private cache. The
software development to be presented includes the operating system for the S-1.
WWW.Gitmgurgaon.blogspot.com
Memory
15
Crossbar
switch
Diagnostic
processor
Unipracessor 0 Uniprocessor 15
Diag- Diag.
1/0 1/0 [1/0
nostic nostic
store 7 stare 0 stare 7
proces processor
Peripheral Peripheral
equipment equipment
Figure 9.6 Logical structure of the 8-1 Mark ILA multiprecesser. (Courtesy of S-E project at Lawrence
Livermore National Lahoratery, 1979.)
660
EXAMPLE MULTIPROCESSOR SYSTEMS 661
(a set of 16 words) in the physical memory. This tag identifies the unique member
uniprocessor (if any) which has been granted permission to retain (that is, own)
the block with write access. It also identifies all processors which own the line with
read access.
The memory controller allows multiple processors to own a line with read
access. However, it responds with a special error flag when a request is received to
grant read or write access for any block which is already owned with write access.
The special flag is also set when a request is received to grant write access for any
block which is already owned with read access. Any uniprocessor receiving such
an access denial is responsible for requesting other uniprocessors to flush or
purge the contested block from their private caches. It does this by using send and
receive messages via the interprocessor-interrupt mechanism within the crossbar
switch, The procedure outlined above thus dynamically maintains cache con-
sistency.
The S-1 design provides a somewhat unconventional [/O subsystem which
consists of many microcoded 1/O channels. Each channel is managed by an L/
processor. The [/O subsystem also contains I/O buffers or memories which are
accessible as part of the S-1 processor’s address space. There is a 2K single-word
buffer for each channel. These I/O memories are shared between an S-1 processor
and an I/O processor. On output, data is placed into the 1/O memory and then
the I/O processor is signalled to transmit the data to the device. Input is handled
similarly. These 1/O memories are managed and assigned through the address-
space management mechanism of the S-1 processor. Thus processes may perform
1/O to devices if they have access to the 1/O memory shared with that 1/O pro-
cessor. The S-1 architecture places little constraints on the I/O precessor, which.
may be a commercially available minicomputer or specially designed hardware.
The I/O interconnection structure is designed to be simple and possess some
degree of fault tolerance. Each 1/O peripheral processor may be connected to
input-output ports on at least two uniprecessors, so that the failure of a single
uniprocessor does not isolate any input-output device from the multiprocessor
system. This fault-tolerance approach is used extensively in the design of the 5-1
to achieve high reliability and availability. For reliability, all single-bit errors that
occur in memory transactions are automatically corrected, and all double-bit
errors are detected regardless of whether the errors occur in the crossbar switch
or in the memory system. The crossbar can be configured to keep a backup copy
of every datum in metnory so that the failure of any memory bank will not entail
the loss of crucial data. System maintenance is facilitated by connecting a diagnostic
computer to each uniprocessor, cach crossbar switch, and each memory bank. This
diagnostic computer can probe, report, and change the internal state of all modules
that it monitors.
WWW.Gitmgurgaon.blogspot.com
662 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Each uniprocessor has a virtual address space of 27° thirty-six-bit words, uniformly
addressable in quarterwords, halfwords, singlewords, and doublewords. The
processor has 16 register sets for fast context switching and each register set has
32 general-purpose 36-bit registers. Two registers are used to maintain the pro-
cessor and user status. The virtual address space is segmented to promote medular
sharing and separate access for each user or task. Variable size segments are
implemented and bounds checking is performed for reliability. The protection
mechanism used in each processor is similar to the ring protection system in the
Multics. Separate address spaces are allowed for each of the four rings which
provide concentric levels of privilege. Gates at each level provide the necessary
protection interface for procedure calls to the kernel.
Facilities are included to perform arithmetic and logical operations on various
data types. The data types include boolean, integer, floating-point with a set of
rounding modes, complex, vectors and matrices. The instruction set is optimized
to contain features for compilers and for operating system efficiency as well as for
arithmetic-intensive and real-time applications. In addition, special [/O instruc-
tions are provided to manipulate the contents of the [/O buffer. The interrupt
architecture consists of vectored interrupts with vector locations which can be
changed dynamically. Interrupts can be individually enabled or disabled and can
be programmed in eight priority levels. The processor priority is also able to be
reset. The uniprocessor has been designed to permit high-speed emulation of
general instruction set architectures.
The uniprocessor is designed especially to facilitate pipelined parallelism in
the fetching and decoding of instructions, the associated fetching of instruction
operands, and the eventual execution of instructions. The preparation and execu-
tion of instructions that specify both scalar and vector operations are pipelined.
Every instruction proceeds through multiple pipeline stages, including instruction
preparation, operand preparation, and execution. Figure 9.7 depicts the internal
logical structure of the S-1 Mark HA uniprocessors. The processor consists of
five major sections which are extremely fast, relatively special-purpose program-
mable controllers that operate in parallel to provide high performance.
Four sections that form the instruction pipeline are for instruction fetch
(F sequencer), instruction decode (P sequencer), operand preparation (I sequencer),
and arithmetic execution (A module). These sections are internally pipelined to
achieve a maximum instruction-issue rate of one instruction per 50 ns, which
is equivalent to a maximum data throughput rate of 720 million bytes/s. The
maximum computation rate of the pipeline is 400 megaflops. The sequencers
and the A module are heavily microcode controlled with a total of 2.5 million
control store bits with a total microword width of 996 bits. A microcode is an
architecture which defines very low-level program that precisely specifies the
operation of every pipeline stage.
Figure 9.8 shows the instruction unit pipeline diagram to consist of 11 major
segments. Some stages of the pipeline, particularly those dealing with operand-
address arithmetic and instruction execution, necessarily have a wide variety of
functions, since the pipeline must process a wide variety of instructions. This
663
(6261
‘ALOPEIOGE'T [EVOIEN] B1OUWLI2AY] BOUAIMET Je }Oafosd ]-S Jo Asa]INO-p) “zossad0sdiuN YI] YLE]A] J-S 2q) fo aanjonuys (wIso] peusszUL ay E176 24nd) F
OHS OW 31038 31036
JOIHIOS S[GRHIAK fonuos 3qeiA [OrHOD JIQRIL A, Jonwos aqui
Qynpow vy} (aaauanbes 5} {ag39uanbas 4) (s30nanbas 4)
yun Hun uoljeedaid yun HEN
uy
Opole HIE paugadig “puvsidgO apodap-uoianays yoyaj-uoronssuy
WWW.Gitmgurgaon.blogspot.com
£31078 Q 910)s8
ayaes
BIED peb--ge{ ZED ayes eed WY4a spo20d BONN SUT
4
OAL | FI Ov!
4
$19)S1233 J3S(}
¥
L 0 a a10}8
2088320
Jd pages ote 10659901 [oulua9 aIQB1W AA
Ov! 9-1 OF!
fsa0nanbas jay}
Hun
aIB] Jou -AIOWI]
a0]
JeQssOId 10 108832010
ansouserp Pe} SIUPHSIUEUL
AIOWsUL pulse
widsj-O.L WOZf-O] -a1s0URFIG
C6L61 ‘Aroyesoqu’y jeuopeN asowedry asuosMeEyy yx Walorg [-g Jo As912N0-})*|-S ul Wesderp auyadid pron BOL INSUL g'6 FMT]
on él al il an $1
Mw dO | Y dO M VS avs | ALA
au aL
oa SNVUL dvMS [XNWONV] 4 Oday dvau ssauddy Daw Lt
RB AHOWD aLNOaXa XOUV do SLVLOY | ALVLOY | PAHOWD |] AHOWD viva NOLLVISNVLL
ri tl a u 01
OLLAWHLIN vad [dvd OF4; = ALSNIONSIW YISNIOAIIN avay
ssauaav v.ivd oat | xaani H3Lad O41 HOLAd OFS-d AHIVO ALSNI
664
EXAMPLE MULTIPROCESSOR SYSTEMS 665
WWW.Gitmgurgaon.blogspot.com
666 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Virtual address
DSegmentitod DPage#
30 29° 28 25
Descriptor
Sezgmentito table
[psecr} saa ;
330 (4*Ring#+
DSegmentito#)*8Qw ,
t Descriptor
: page table
STE _ :
DPage #*4QW
+
PTE
Translated
descriptor : °
address | TSegmentito# | TPage# | Offset
24 16 18 12 11 QO
Target
Segmentito table
(1 page of the
_ descriptor segment)
t
TSegmentito#*8Qw
4 Target
page table
STE > }
‘ TPage#*4Qw
+
PTE
23 1 0
Physical address
Figure 9.9 Virtual-to-physical address translation in S-£. (Courtesy of S-1 Project at Lawrence Livermore
National Laboratory, £979.)
WWW.Gitmgurgaon.blogspot.com
668 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Table 9.2 Comparison of the expected performances of the S-1 Mark ILA Unipro-
cessor, the Cray-1, and the CDC 7600.
Computation rate, megaflops
nor the Cray-1 provides a low-precision floating-point format. The Cray-1 has
only a 64-bit word format. For signal or real-time processing applications, the
Mark HA is expected to perform about four times better than the Cray-1. The S-1
Mark ILA uniprocessor is built out of ECL 100K medium-scale integrated circuits
in performance-critical areas and ECL 10K circuits elsewhere. However, the 5-1 is
still not realized and the expected performance being reported may be rather
optimistic.
intensive problems (e.g., physical simulation), and secure environments for data.
It also supports full use of the $-1 architectural features by providing multiprocessor
support, the management of large. segmented address space, and exploitation of
reliability features. The Amber O/S combines functions of the file system and
virtual memory. The file directory structure is hierarchical and tree structured.
Files are represented as segments. Segmentation facilitates dynamic linking. A
demand-paging policy is used to copy pages directly between disk records and
main memory. Page replacement works globally on all of main memory and uses
the approximate least-recently-used algorithm for eviction of pages. The LRU is
not always optimal for some applications such as real-time applications. In such
cases, other placement and replacement algorithms are used.
The Amber O/S supports multitasking by the division of problems into co-
operating tasks. It also provides low- and high-level scheduling features. The low-
level features provide simple mechanisms for real-time applications. Examples are
priority scheduling with round-robin queues, dedicated processor assignments,
and interrupt processing. The high-level scheduler may implement complex
features such as resource allocation and load balancing on multiprocessor
configurations. Interprocess communication techniques such as message channels
are provided. Other synchronization techniques supported are software interrupts
and event notifications. Time-outs on event waits are also implemented. The
Amber also possesses features to enhance availability and maintainability. Time-
outs on all waits and suspension of processes are performed to prevent deadlock
situations. Monitor tasks run concurrently with user and system tasks to detect
hardware malfunctions.
WWW.Gitmgurgaon.blogspot.com
670 COMPUTER ARCHITECTURE ANID PARALLEL PROCESSING
Packet
Switched
Network
Data
memory memory
1/0 cache
i A Other
1/0 channels vO
control Vo.
TTETTE TET TTT devices
Mass storage devices
Figure 9.10 The architecture of a typical HEP system with four processors. (Courtesy of Denelcor, Inc.,
1982.)
section. All instructions and data words in the HEP are 64 bits wide, although
data references within the PEM can access halfword, quarterword, and bytes.
Switch network
I/O cache
memory
To I/O and
control
subsystem
1/6 channels
Controller
Special
: Channel-to-chan
1 s purpose
. interface * 70
Controller
Mapnetic
tape
Disk storage
modules
Figure 9.1 The mass storage system (MSS) in the REP. (Courtesy of Denelcor, Inc., 1982.)
WWW.Gitmgurgaon.blogspot.com
CTR6E “9u] Sdoajauag jo 483}3N0}) "GAH JO SSIA 24) WY sjuanodulos yeuONOUN Z]"¢ aaNIEy
ene nomen eee oe eee on eee ce ne et A EO EH RR re orm Hen te
AIGUISEHE BYES O/T
aoe pio Jeuues
uSUms ALOUOW Bae
Pe/Licddd
doo!
Sep IaTUL
U0
do}
Sid PUMIDWOS FO}
wed pend
1
BAyp aAllp
481g AIG
dee eee ee eau emen eee mee ene ener ee ne eee ae etna ae ee ae en ok ne ed
672
EXAMPLE MULTIPROCESSOR SYSTEMS 673
a word for a channel every 100 ns. Cache memory request messages are received
from the switch network through the switch interface. This interface is coupled to
a switch node and can service a memory request from the switch every 100 ns.
!
Port A input Port B input
!
Port € input
j Y 1 '
Routing logic
¥
Port A output Port B output Port C output
{ |
(4) Routing control
Figure 9.13 Switching node in the HEP’s interconnection network. (Courtesy of Denelcor, Ine., [982,)
WWW.Gitmgurgaon.blogspot.com
674 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
the distance from each message to its addressed destination is reduced. This is
accomplished by incorporating within each node three routing tables (one per
port), which are loaded when the system is configured. Thus, each switch node is
programmed to know the best output port routing to the final destination. Such
programmed routing techniques allow for alternate routing to bypass a faulty
component. In practice, the actual routing is determined by the best routing path
and by priority in the case of conflicts. The priority is implemented by the use of
age counters which increment with each nonoptimal routing.
A unique feature of the switching network is that the switch nodes do not
enqueue packets. These packets are routed by the switch nodes every 50 ns regard-
less of port contention. The modularity of the switching network permits field
expandability. The increased memory access times that result from the greater
physical distances between the PEMs and the DMMs can be compensated for in
two ways. Each PEM contains a local memory large enough to buffer most of the
program codes. Since cach switch node is pipelined, each processor Can execute
a large number of instruction streams concurrently.
Branch Branch
Divide Divide
Multipl Muttiply
Add Add
Ads ALU
Branch
Divide
Mubtiply
(9 MIMD processing
Figure 9.14 Achieving maximal paralfelism with replicated hardware in the HEP. (Courtesy of Denetcor,
Inc., 1982.)
Program memory
Register '
cB Constant memory i.
memory :
Task ;
queues ‘Tastruction
Conirol-units : processing
Task ‘unit (PU)
queues
Function
Create function SFU] §$ 5 s
units
| 1 |
bees eee eaa eb eh ee ee ee ee ot
Switch
Figure 9.15 Functional description of HEP’s process execution module. (Courtesy of Denelcor, Inc.,
1982.)
allocated to a PEM are buffered in the program memory. These instructions are
fetched from the program memory every 100 ns with concurrent decoding and
execution of previously fetched instructions, as depicted in Figure 9.16, Up to 50
instructions may be in various stages of execution operating on one or more data
streams simultaneously. However, the instruction fetch unit does not seem to
permit simultaneity in instruction fetches and decodes. Also, there is only one
instruction fetch unit in the PEM. For these reasons, the performance of the HEP
processor may be limited to one instruction cycle per 100 ns. This may subsequently
limit the effective utilization of the functional units.
The IPU in each PEM includes 2048 interchangeable general-purpose registers,
as well as constant memory and function units. The constant memory is used to
store user program constants and is read-only by user programs. The 4096 locations
in the constant memory eliminate the need for data memory accesses for program
constants. The function units implement the HEP instruction set, which includes
extensions used to coordinate MIMD processing. In addition, the IPU has pro-
visions for four expansion function units. These units may be used for custom or
special-purpose instructions at the user’s option.
In the HEP system, a set of cooperating processes constitute a task. Tasks and
processes can be of two types: user or supervisor. The execution environment of a
task is its task domain, which is defined by a 64-bit task status word (TSW), The
TSW provides protection and relocation information for each task by a specifica-
tion partition of the program, constant, register and data memories into areas.
EXAMPLE MULTIPROCESSOR SYSTEMS 677
Program
memory
Instruction Operand OP 1 :
Process queue _ fetch PP fetch Registers
- OP 2
a
Functional units
pe Multiply ~
Increment control
_ Add _
*
*
.
Divide
This information is encoded as the program base, program limit, data base, data
limit, constant base, register base, and register limit. All virtual addressing to
operands is relative to the base addresses in the memory area in which they are
stored. A task status register is used to hold the TSW for each task domain. A
16-entry task queue, where each entry contains a unique TSW, is used to imple-
ment a simple first-in first-out discipline in deciding which ready-to-run task to
schedule. The task queue is equally divided for user and supervisor tasks.
In addition to the TSW, there is a process status word (PSW), which contains
a 20-bit program counter and other state information for a HEP process, Each
PSW points to an instruction that is ready for execution. There is a process tag
(PT) in the task queue for each PSW that points to an instruction that is ready for
execution. When a task is first initiated, it has only one PSW;; that is, one process.
The software creates additional PSWs as new processes are created to initiate
parallel processing within a task. There is a PSW queue which can hold a total
of 128 PSWs: 64 for user processes and 64 for supervisor processes.
These PSWs in the process queue citculate in a control loop which includes
an incrementer and a pipeline delay. The delay is such that a particular PSW
cannot circulate around the control loop any faster than data can circulate around
WWW.Gitmgurgaon.blogspot.com
678 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
the data loop consisting of general-purpose registers and the function units. As
the program counter in a circulating PSW increments to point to successive
instructions in program memory, the function units are able to complete each
instruction in time to allow the next instruction for that PSW to be influenced by
its effects, The contrel and data loops are pipelined in eight 100-ns segments, so
that as long as at least eight PSWs are in the control loop, the processor executes
10 MIPS. However, a particular process cannot execute faster than 1.25 MIPS,
and will execute at a lesser rate if more than eight PSWs are in the control loop.
The instruction issuing operation maintains a fair allocation of resources
between tasks first and between processes within a task second. The main schedule
contains 16 task queues, each containing up to 64 PTs. A secondary queue called
the snapshot queue records the head PT in each task queue each time the snap-
shot queue becomes empty. PTs arriving one at a time from the snapshot queue
cause the issuing of an instruction from the corresponding process into the
execution pipeline.
A control unit cooperates with the function units to execute instructions in
the IPU. The control unit selects an instruction for execution from one of the task
queues, fetches the instruction, addresses the operands, and passes the instruction
operation code and the operands to one of the function units to perform the
specified operation. There are two types of function units: synchronous and
asyuchronous. The synchronous function units are pipelined with eight linear
segments and a segment time of 100ns. Thus, instructions are completed in
800 ns. Examples of synchronous function units are the floating-point adder
(+), the multiplier function (*), the integer function unit (FU), the create function
unit (CFU), the hardware access (HA) unit, and the system performance instru-
ment (SPI).
The CFU performs all operations affecting the PS Ws, This includes activating
and terminating processes, incrementing the program counter in a PSW that has
had an instruction executed, and executing branch and supervisor call instructions.
The HA executes all instructions to read or write program memory and performs
bit encode and decode operations. The SPI collects data for performance measure-
ment and monitoring counters and tracks the number of instructions executed by
tasks. This allows billing the user for the amount of work done regardless of the
lime required because of overheads.
Asynchronous function units do not necessarily complete their operations
within 800 ns. Examples of such function units are the divider {+} and the scheduler
(SFU), The divide function unit consists of up to eight individual divider modules
which asynchronously execute 64-bit floating-point divide instructions. Divide
instructions are initiated at a rate of one every 100 ns until all divider modules are
busy. Each module can execute a divide instruction every 1700 ns. This is the only
function unit in the IPU that is not pipelined. The SFU is both synchronous and
asynchronous, and executes all instructions involving data transfers between
register memory and data memory. Transfers that pass through the switch are
executed asynchronously; all others are executed synchronously. The SFU can
accept a new instruction every 100 ns.
EXAMPLE MULTIPROCESSOR SYSTEMS 679
WWW.Gitmgurgaon.blogspot.com
680 COMPUTER ARCRITECTURE AND PARALLEL PROCESSING
of this kind of operation, the third access state reserved is implemented in the
registers. The destination register is set reserved when the source data is sent to the
function units, and only when the function unit stores the result is the destination
set full. No instruction can successfully execute if any of the registers it uses is
reserved,
A process failing to execute an instruction because of an improper register
access siate is merely reinserted in the queue with an unincremented program
counter so that it will reattempt the instruction on its next turn for execution. A
program executing a load or store instruction that fails because of an improper
data-memory access state is reinserted in the SFU queue and generates a new switch
message on its next attempt.
routine was CREATEd, a RESUME has no effect. On the other hand, a RETURN
causes the termination of the process if it was CREATEd or if it previously executed
a RESUME.
HEP Fortran generates fully reentrant code and dynamically allocates
registers and local variables in data memory as required by the program. Hence, it
is easy to create several processes which simultaneously execute identical programs
on different data. This can be accomplished by placing a CREATE statement in
a loop so that several parallel processes will execute identical programs on different
local data. An example of the implementation of parallel operations in the HEP
is given below.
Example 9.1
PURGE SIP, SNP
$NP = NPROCS
DO 101= 2, NPROCS
siP = 1-1
CREATE S(SIP,SNP)
10 CONTINUE
SIP = NPROCS
CALL S(SIP,SNP)
C WAIT FOR ALL PROCESSES TO FINISH
~ 20 N= SNP
$NP = N
IF (N NE. 0) GO TO 20
SUBROUTINE S(S IP, SNP)
MYNUM = SIP
$NP = $NP - 1
RETURN
END
L = 270
N = 20
PURGE SE, SIN, SIW
SE = 0.0
SIN = 1
sIW = 0
DO 1001 =1,N
CREATE DOALL (SIN, SIW, L)
100 CONTINUE
200 IF (VALUE (SIW).LT.L) GO TO 200
SUBROUTINE DOALL (SIN, sIW, L)
COMMON A(270), B(270), C(270), D(270), SE
1 | = SIN
sIN=1+1
IF (|.GT.L) GO TO 30
A(l) = A(I)"*SIN(BUL))
IF (SIN(A(I)).GT.COS(C(I))} GO TO 10
A(l) = Al) + C(I)
GOTO 20
10 A(I} = Atl) — DCI)
20 SE = SE + A(l)""2
SsIW = SIW +1
GOTO1
30 RETURN
(6) Parallel version.
Figure 9.17 Algorithm restructuring example in using HEP for parallel processing.
682
EXAMPLE MULTIPROCESSOR SYSTEMS 683
Example 9.2 The serial code shown in Figure 9.17¢ manipulates the linear
array A with 270 elements and accumulates the square of the components ofA
in a variable E. The asynchronous variables sE, $s N and sIW are introduced
in the parallel version of the program shown in Figure 9.17) $E is used to
mutually exclusively accumulate the square of A(i)’s when they are updated.
The parallel version creates 20 processes so that the granularity is significant
enough to have nontrivial processes, sIN is used to control accesses to unique
components of A and sIN is used to terminate the parallelism when all
components of A have been updated and $E computed.
END
SUBROUTINE T (SSTART, $DONE, $ALLDONE)
10 1=14+1
IF (1.GE.400) GO TO 99
SDONE = SDONE + 1
GO TO 10
99 IF (VALUE ($DONE).EQ.400) SALLDONE = 1
RETURN
END
WWW.Gitmgurgaon.blogspot.com
684 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
kh 4‘ i T Processor
slorage
¥ t t Y (PS)
Storage controller (SC)
370/168 CPU
trolled by the storage controller. There is only one central processing unit (CPU)
which contains the pipelined instruction decede and execution units together
with a fast cache. Multiple L/O channels can be connected to the CPU. Each
channel is a specialized I/O processor with a simple I/O instruction set and operates
asynchronously with the CPU. An AP configuration for the IBM 370/168 is
illustrated in Figure 9.19. It is extended from the uniprocessor configuration by
attaching an attached processing unit (APU).
The APU is almost identical to the standard 370/168 CPU except no I/O
channels can be attached to it. Both the CPU and APU have their own caches.
The cache coherence problem is resolved by using the cache invalidate (CI) lines
between the CPU and APU. The interprocessor communication (IPC) lines are
used for exchanging information or interrupt signals directed between the two
processors. The structure is considered asymmetric because the APU is devoted
exclusively to computation and the CPU handles both internal computation and
[/O communications. The multisystem control unit (MCU) performs the inter-
connection switching functions between the processors and the shared memory
modules.
An MP configuration of the IBM 370/168 is shown in Figure 9.20. Instead of
attaching an APU. another CPU is used to form a symmetric dual-processor
system. This MP configuration is composed of two 370/168 uniprocessor systems
with shared memories. The two CPUs have equal capabilities. Two sets of [/O
WWW.Gitmgurgaon.blogspot.com
686 COMPLITER ARCHITECTURE AND PARALLEL PROCESSING
t }
Multisystem control unit (MCU}
’
ag Clogs
CPU APU
Pe
Attached processing unit
os
1/O channels
Figure 9.19 An LBM 370/168 AP configuration. (Courtesy of International Business Machines Corp.,
1978.)
PS, PS,
¥ .
MCU
4 4
{
aaenereeen CE eaed
CPU, CPU,
IPC”
one eee
Channels Channels
Figure 9.20 An 1BM 370/168 MP configuration. (Courtesy of International Business Machines Corp.,
1978.)
EXAMPLE MULTIPROCESSOR SYSTEMS 687
channels attached to each CPU are mutually exclusive and cannot communicate
with each other directly. The MCU provides the necessary interconnection hard-
ware between the two CPUs and shared memories. [t also contains a configuration
control panel for the purpose of manual systems reconfiguration,
In the IBM 370/158 MP configuration, two processors share from 1 to 8
million bytes of main storage. Each processer has a separate 8K-byte cache with
230-ns access time of 8 bytes. The two processors in the 370/168 MP share from
2 to 16 million bytes of main storage. Each CPU has either an 8K-byte or 16K-
byte cache with a reduced 80-ns access time of 8 bytes. The model 158 has 10
block multiplexor channels, while the model 168 can have up to 22 block multiplex-
or channels. The block multiplexor channels permit concurrent processing of
multiple channel programs for various speed peripheral devices, as was illustrated
in Section 2.5. The Model 168 is enhanced from the Model 158 mainly in the area
of memory and I/O subsystems. Their CPUs essentially have the same capabilities.
The 370/168 MP configuration is considered loosely coupled because two
separate copies of operating systems are running in the two CPUs. An IBM 370/168
uniprocessor system can also be reconfigured to a tightly coupled multiprocessing
system of dual CPUs with shared memories and shared 1/O devices, as shown in
Figure 9.21. The two CPUs are tightly coupled by a single copy of the operating
system in the shared memory. A tightly coupled CPU pair can also be loosely
coupled with another uniprocessor CPU to form a mixture multiprocessor system,
as demonstrated in Figure 9.22. This is really a tightly coupled multiprocessor in a
loosely coupled configuration. The tightly coupled dual CPU and the uniprocessor
share some direct-access devices, such as disk, and some tape units. A channel-to-
channel adaptor can be used to link the two CPU modules. Each CPU module still
has some private channels connecting to some private LO and secondary devices,
like the 3330 disk storage subsystems.
The IBM 3033 system The 3033 multiprocessor complex consists of two 3033
Model M processors, two 3036 consoles, and the 3038 multiprocessor communica-
tions unit (MCU). Figure 9.23 shows a conceptual! relationship between the MCU
and processor functions. The 3033 attached processor complex consists of a 3033
Model A processor, the 3042 attached processor, two 3036 consoles, and the
3038 MCU.
The MCU for the multiprocessor-attached processor models provides
prefixing, interprocessor communication, cache (high-speed buffer) and storage
update communication, sharing of processor storage, configuration-partitioning
control, synchronization facilities, and communication of changes to the storage
protection keys.
The MCU also enables both processors in an MP configuration to access all
of processor storage while retaining the overlap capability in storage operations
permitted by eight-way interleaving. This means that both processors can have
concurrent storage operations in progress with a varying degree of overlap depend-
ing upon the particular sequence of LSU accesses. The configuration and parti-
tioning control in an MP system provides a variety of storage configuration options
WWW.Gitmgurgaon.blogspot.com
688 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Dual processor
[otre one mere eee eee ee ee ee een
ee nee ee ne ee 4
i Shared
i memory '
i ;
i : > ’ a !
i CPL, ~ Operating ~ CPU, '
: system i
:
|\
' '
i' . . 1
‘ > Configuration panel imp ‘
: I
i
|
‘ I
Deca ed ennai eee ae eee ea A ee eae neo nace eee pecs ope eee J
ooo een
TT]1/0 devices
Controller
Pp oc |
Secondary storages
Controller
PT] oe |
Terminals
Figure 9.21 A tightly coupled 1BM dual processor system with a single copy of operating system residing
in the shared memory.
EXAMPLE MULTIPROCESSOR SYSTEMS 689
Tape
Tape
Shared
memory Memory
. — Channel-to-
Tightly channel adapter
coupled IBM
dual uniprocessor
processors] CPU CPU CPU
Disk
oes » .
Disk
Ngee ad
Channels Channels
Figure 9.22 A tightly coupled IBM dual-processor system loosely coupled with an LBM uniprocessor.
Devices Devices
oon one
Channels Channels
1BM 3038
IBM IBM
3033/ MCU 3033
Model M Model
WWW.Gitmgurgaon.blogspot.com
699 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
which can apportion the storage either independently to each processor for uni-
processor mode or shared between the two processors for multiprocessor mode.
The 3033 MP complex achieves its high performance through higher interprocessor
communication, better cache and storage protection and faster access to shared
processor storage than the 370/168 MP.
The 3033 MP contains another improvement over the 168 MP and AP, in
that the priority bit mechanism has two levels. Low-level bits compete with low-
level bits and high-level bits compete with high-level bits but, as might be suspected
from the names, a high-level bit can preempt a low-level bit. The 3033 with its
enhanced buffering can, at times, sustain an exceptionally long string of storage
requests, so this facility prevents a high-priority request on one processor from
being unnecessarily delayed by low-priority activity on the other. The 3033 MP
performance has been stated to be 1.6 to 1.8 times that of the 3033 uniprocessor
system, based on simulation results and running experience.
The IBM 3081 system The IBM 3081 processer unit has a symmetric organization
of two central processors, each with a 26-ns machine cycle time, and executes
IBM System/370 instructions at approximately twice the rate of the EBM 3033.
One of the goals set in the design of the 3081 was better price/performance index
over the System/370 and 303x series. The other goal was upward compatibility
with those product lines. Furthermore, it was designed to have improved reliability,
availability, and serviceability through new technology and partitioning and
packaging schemes. One of the major achievements in packaging is the develop-
ment of a field replaceable unit called a thermal conduction module (TCM), which
contains up to 130 IC chips on one substrate. With these TCMs, a large beard
called a clark board was developed. Each board contains up to either six or nine
TCMs, which made it possible to package the entire processor unit on four boards
in one frame.
The 3081 processor unit organization shown in Figure 9.24 consists of five
subsystems: two central processors, the system controller, the main storage, and
the external data controller (EXDC). The 3081 is called a dyadic processor since
it is configured as two identical processors that share a system controller and
EXDC within one frame. Furthermore, they act as a tightly coupled multiprocessor
which cannot be decoupled to act as two independent uniprocessors as in the
IBM 3033. The configuration is symmetrical because each processor has the same
priority and operational characteristics with respect to the central storage and
channels. Each processor has access to channels and to central storage via the
controller,
The processors share main storage, which could be 16, 24, or 32 megabytes. A
segment of main memory called the system area is reserved for microcode. This
area also contains unit control words (UCWs) for I/O devices and system tables
and directories. Hence, the system area is not accessible to user programs. The
main storage is organized as a card-on-board package and is two-way interleaved.
Each board, which contains 4M bytes of main memory, is called the basic storage
module. This module is configured so that a block (128 bytes) of data can be
EXAMPLE MULTIPROCESSOR SYSTEMS 691
Central
storage
External
System data
aareet er controller
) (EXDC)
accessed with a single operation to efficiently transfer a block between the processor
and memory. The interleaving scheme uses doubleword, which is the basic unit
of memory operation, and a 2K-byte address across the 4M-byte modules. The
2K byte segment, which was chosen to minimize the complexity of memory
reconfiguration, can thus be independently accessed.
The system controller provides the paths and controls the communications
between the main memory and other subsystems. The basic data-bus width of all
units connecting to the controller is 8 bytes with a data transfer rate of 8
bytes per machine cycle. The bus is bidirectional. The controller also contains the
storage protect keys and time-of-day clock and manages an eight-position queue
containing storage requests. The 3081 can support up te 24 channels, which can
be of either the byte-multiplexer or block-multiplexer type. Channels are con-
trolled by the EXDC. The EXDC consists of two types of microcode-controlled
elements. One of the elements handles the control of I/O instructions and interrupts
The other handles the data control sequencing and provides bufferig for each
group of eight channels.
Each processor consists of five functional elements, as shown in Figure 9.25,
and packaged within a nine-TCM board. The processor is not a pipelined processor
as in the IBM System 370/168. However, it has an effective instruction prefetching
capability. Each processor has three separate execution elements, a buffer contral
element, and a control store element. The execution elements are the instruction
elements, variable-field element, and execution element. The instruction element
controls the instruction sequencing ofa processor by initiating requests for instruc-
lions and altempting to maintain an instruction buffer of four doublewords locally.
It performs the instruction-decode and operand-address generation functions
and initiates all requests for operands. It also executes all arithmetic and logical
operations.
The variable-field element operates under horizontal microcode control,
executes all variable length, storage-to-storage instructions. Within it is a decimal
adder and its associated input and output regions. In executing the specified set
WWW.Gitmgurgaon.blogspot.com
692 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
To the
system controller
T
Variable field Instruction Execution
element element element
(VFE) (IE) (EE)
Figure 9.25 A central processor of the IBM 308), (Courtesy of International Business Machine Corp.,
1986.)
WWW.Gitmgurgaon.blogspot.com
694 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Architectural evolution of Univac 1100 series The 1100 series hardware architecture
is based on a 36-bit word, one’s complement structure which obtains one operand
from storage and one from a high-speed register, or two operands from high-speed
registers. The 1100 Operating System is designed to support a symmetrical
multiprocessor configuration simultaneously providing multiprogrammed batch,
EXAMPLE MULTIPROCESSOR SYSTEMS 695
1107 Class 1
1108
Class 2
1110
Class 3
1100/40
1100/80
{[_ Class 4
1160/90
WWW.Gitmgurgaon.blogspot.com
696 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Availabilty
contro}
unit
(ACU) Central Central Central
processor processor proces
(CPU) (CPU) (CPU)
16 16 16
Oc 1/0 channels /O channels
Console Console Console
Vv
controller controller
(lOc) doc)
Channels Channels
9 - 45 0 ~ £5
Figure 9.27 The architecture of Univac 1108 multiprocessor. (Courtesy of Sperry Rand Corp.. 1965.)
and then, without allowing any other processor to access the same memory word,
to set the semaphore bit. If the semaphore was initially set, an interrupt occurs
(indicating that the item protected by this semaphore is already used}. At this
point, the interrupted process is queued until the semaphore is cleared. If the
semaphore was initially clear, the next instruction is executed. Execution of the
TEST-AND-SET instruction must precede the use of any data where erroneous
results could be produced by two or more instruction streams operating on these
data concurrently.
The introduction of the multiprocessor version of the 1108 led to the develop-
ment ofa new kind of system component called the availability control unit (ACU),
This unit allows partitioning of the system into three smaller independent systems
for debugging of either hardware or software on one system, while normal opera-
tion (at reduced throughput) continues on the remainder of the system. Each
processor periodically sends a signal to the ACU indicating that the processor is
still functioning and the executive is still in control. If the ACU does not receive all
EXAMPLE MULTIPROCESSOR SYSTEMS 697
Model
The Univac 1100/80 systems The 1100/80 performs an add instruction in only 200
ns. Important features of the 1100/80 architecture and design approaches are
listed below:
WWW.Gitmgurgaon.blogspot.com
698 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
SIU
CPU
10U
Figure 9.28 A single-processor Univae £100/86 configuration. (Courtesy of Comm. of ACM, Borgerson
et al., 1978.)
interface unit. There is still only one cache, which is common to both processors
and located in the storage interface unit.
Figure 9.30 depicts a 4 x 4 system. A second storage interface unit with its
own independent cache is now present and connected to the two additional
precessors and I/O units. The two storage interface units have a cache invalidate
interface which ensures that if both caches contain copies of the same data, altering
the copy in one cache will cause the corresponding copy in the other to be marked
as itivalid.
Main memory is a common resource for all processors and I/O units and is
accessed by them via the corresponding storage interface units. There can be up
to four main storage units, each containing from 512K to IM words of memory.
Each main storage unit is connected to both storage interface units and can be
two-way interleaved. Processors are connected to each other by interprocessor
interrupt interfaces, which permit a processor to cause an interrupt in any other
processor. An I/O unit is electrically connected to only one storage interface unit
and to the processors on that storage interface unit. As a result, a processor can
handle I/O only on I/O connected to the same storage interface unit as itself.
EXAMPLE MULTIPROCESSOR SYSTEMS 699
SIU
10U, 1OU,
The central processor of an 1100/80 has a 36-bit word length and a reasonably
rich repertoire of fixed-point, floating-point, data-movement, and character-
manipulation instructions. The architecture is essentially register-oriented, with
separate index registers and accumulators. Most double-operand instructions
have one operand in a register and one in memory. Central to the architecture of
this system is a set of 128 words called the general register set (GRS). Programs
can address {6 index registers and [6 accumulators.
The 1100/80 is installed with a new group of instructions to accelerate common
functions for both users and the executive. These include several context-switching
instructions such as save and restore system status and load and store GRS, and
user-oriented instructions, including new constant storage and memory increment
WWW.Gitmgurgaon.blogspot.com
TM COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
?
Leeman manny
*
peeesceeeed
Figure 9.30 A four-processor Univac 1106/80 configuration. (Courtesy of Comm. of ACM, Borgerson
et al., 1978.)
and decrement instructions. Two new instructions were also added to support the
autorecovery feature of the 1100/80. These instructions reset the autorecovery
timer and toggle the autorecovery path. When autorecovery is enabled and the
system software does not reset the automatic recovery timer within the preset
time interval, the system transition unit (similar to the ACU of the 1108) clears,
reloads, and reinitiates the system. Two recovery paths are provided. The alterna-
tive recovery path is system initiated when an attempted automatic recovery fails.
The instructions mentioned above provide for software resetting of the automatic
recovery time and for selection of the first automatic recovery path to be used by
the next recovery attempt.
The 1100/80 introduced instructions to aid the address-space manipulator.
The most significant new instruction transfers a two-word segment descriptor
directly from the segment descriptor table to the segment descriptor register, saves
the previous contents of the segment descriptor register, and branches. The
granularity of segment sizes has been improved on the 1100/80, Segments can be
EXAMPLE MULTIPROCESSOR SYSTEMS 7(Q1
as large as 262K words and can be specified in 64-word granules beginning on any
512-word boundary and ending on any 64-word boundary.
Input-output channels on the 1100/80 are available in two forms. Word
channels are available that are compatible with the 1110 system. Additionally,
intelligent byte channels are available that allow the direct usage of byte-oriented
peripheral equipment. The 1100/80 uses a high-speed cache memory between the
processor and main storage, The cache memory is transparent to the user. It is
constructed of emitter-coupled logic storage elements and contains up to 16K
words; these words are the most recently used contents of main storage. The
physical main storage capacity was increased to a maximum of 4M words of MOS
memory. Single-bit error correction and double-bit error detection are provided.
The Univae 1100/90 systems The Univac 1100/90 multiprocessors are the most
recent systems by Sperry Univac. The systems permit one, two, three, or four central
processing units (CPU) as 1100/91, 1100/92, 1100/93, and 1100/94 systems, respec-
tively. The 1100/9x is an x-by-x system containing x CPUs and x 1/O processors,
which can be tightly coupled, Figure 9.31 shows an example of a four-by-four
system, However, loosely coupled configurations are also possible in which there
are two independent systems sharing one mass storage subsystem.
The 1100/94 system configuration, in addition to having four CPUs and four
L/O processors, contains four main storage units (MSU) and two system support
processors, Each CPU is pipelined with an 8K. word instruction cache and an 8K
word data cache. A word is 36 bits wide. Each cache is organized into 256 sets
with four blocks per set. Each block contains eight words. The CPU uses a virtual
addressing scheme with 2°° words of address space. The initial address is divided
into four portions, A segmentation scheme is used with a maximum of 262,144
segments, A write-through policy is used to update the MSU. Ona write to shared
data in a local cache, all caches in other CPUs containing a copy of the block are
invalidated,
Each MSU contains four independent banks. A block read is a single reference
resulting in four doubleword transfers with a 600-ns cycle time. A doubleword read
is accomplished in 360 ns, The 1100/90 systems use the same system software as
the 1100/80 systems for upward compatibility.
Operating system features in Univac 1100 series The operating system structure
for the 1100 series consists of multiple layers of software, as shown in Figure 9.32.
The structure of the kernel of the operating system is discussed here. The 1100
series executive system is called EXEC. A user’s request to EXEC is made by
executing a software interrupt instruction called executive request interrupt (ERY).
Execution of this instruction in a processor causes a transfer of control to the
executive, The EXEC has an input-output contral routine cancept called a
symbiont (spooling routine). These routines overlap read, print, and punch
Operations with program execution. The EXEC also possesses multiprogramming
capabilities designed to operate in both a multiprogram and multiprocessor
WWW.Gitmgurgaon.blogspot.com
702 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Figure 9.31 The Univac 1100/94 system with four pracessors. (Courtesy of Sperry Rand Corp., 1983.)
End user
facilities
Communications
management system
transaction interface
package
Data
Com. 1100 J manage-
B
:
pilers Executive ment
system
Batch/demand
control language
Application
packages
The origin ofa program is in symbolic elements within the run stream. These
elements are then compiled to form relocatable elements which are collected
(bound) with other relocatable elements to form an absolute element (the program).
The term “absolute” refers to the program relative-address solution only; the
relative-addressing capability of the hardware allows the program to be loaded
(or swapped and reloaded) and executed anywhere within main storage. References
to shared segments of both user code and system libraries are resolved during
execution,
A batch stream which enters the system is first processed by the symbiont
complex. This complex disassociates the run stream from the relatively slow unit-
record device speeds and allows tasks to proceed at higher mass-storage speeds.
The run stream is scanned for facility allocation and prescheduling. Multiple
asynchronous input-output services are allowed. This is particularly important
in a multiprogramming-multiprocessing envirenment.
WWW.Gitmgurgaon.blogspot.com
704 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
scheduler. We have only sketched the EXEC operations here. Interested readers
should check with the Univac EXEC manuals for details.
WWW.Gitmgurgaon.blogspot.com
C8261 “uy ‘zaindu0D wapue g jo Asajin0}) “Ssossad0id smog YM waysss wapURL PEIdAy y ¢E'g aiNBiy
t q
4 1
° o
d d
3 t 3 ?
1 i 4 4
$—TF 6 | 42 o| 24],
d d d d
! q 1 ? t }
1 1 af. 4 a] 1
ol 9A fo +> o| “Glo o| 94 igh?
d d d d d d
J9]JOIIUOS SASP 23d
[auueys O/] fury? C/I fouuRYyD O/] fuuzys O/]
Ar0uayl suns} Alo ALOU
S1OSS300i4 Ndo Ado ndd do
jONUOD jonuos fostuds jo1}U0D
SHQEUAC] snqeudq snqeuigq snqeucd
A
706
x ll o—
snqeudg
EXAMPLE MULTIPROCESSOR SYSTEMS 707
' ¥X bus 4
i
\ ¥ bus ,
Le =
perenne mca amanened L nna en eee eee cence cee cee ne cee cee ee eee eee eee ete
: Processor module ;
: |
i Central
: processor ht, }
: : unit ‘.
: ; - % :'
ii ‘! .. '
i i % '
} ; |
Y 1 i ¥ :!
i; '‘
: Interprocessor Lee : .
1/0 channel ‘
:1
i ? Memor elm
: bus ¥ control i
$ control ‘
/
LL.
1/O channel
Figure 9.34 System components in one processor of the Tandem/16, (Courtesy of Tandem Computer,
Inc., 1978.)
interprocessor bus controllers. There are two sets of radial connections from each
bus controller to each processor module. They distribute clocks for synchronous
transmission over the bus. No failed processor can dominate the dynabus utiliza-
tion. Each bus has a clock associated with it, running independently of the pro-
cessor clocks.
The dynabus interface controller consists of three high-speed caches, two of
which are associated with the two buses and one is an output queue that can be
switched between the two buses. Each caches has 16 words and all bus transfers
are made from cache to cache. All components attaching to the buses are kept
physically distinct, so that no single component failure can contaminate both
buses simultaneously. Also, the controller has clock synchronization and interlock
circuitry. Ail processors communicate in a point-to-point manner using this
shared bus configuration.
For any interprocessor data transfer, one processor is the sender and the other
is the receiver. Before a processor can receive data over a bus, the operating system
must configure an entry in a table known as the bus receive table (BRT). Each
BRT entry contains the address where the incoming data is to be stored, the
sequence number of the next packet, the send processor number, the receiver
processor number, and the number of words to be transfered. A SEND instruction
is executed in the sending processor, which specifies the bus to be used, the intended
WWW.Gitmgurgaon.blogspot.com
C8261 “2uy “teindwioD wapuey jo Asaj1n0p) *4ajsuel) EEp Jossavo1dlayy Joy aavyiaaly snqeusp g] /uapuRy Seg aunayy
OF Jossa0014 ] 105899014
sigyng 9] or dy
a a a aa
Al
eed
4 i i
: ssauppe Jayjng
pecnno nen Sy
: Jaqunu
; 1ossa90ig 3
ALOU]
X
; AIOUIS [Aj
!
i
i
DON
i sossa3oid 4os53301d it Oy puas|
O10 AY -O1JA
iW
} ABpynG Bola
idnisorur-o29 j nad yf nado
i }
rf i
13qQunu
ONI ON sossaoo1g « FOLAO
3 :
: [o1}H09 sng i Jougs sng
f
f
= —+
——
:
=
+ ¢
7f
TE ma 1
j
fo TET ccn serene nrane ee fons sone snes nnn snun dhs sesnsnnnou SSNS
£ _" ob a ry
708
EXAMPLE MULTIPROCESSOR SYSTEMS 709
receiver, and the number of words to be sent. The sending processor’s CPU stays
in the SEND instruction until the data transfer is completed. Up to 65,535 words
can be sent in a single SEND instruction. While the sending processor is executing
the SEND instruction, the dynabus control logic in the receiving processor 1s
storing the data away according to the appropriate BRT entry. In the receiving
processor, this occurs simultaneously with its program execution.
A message is divided into packets of 16 words. The sending processor fills its
outgoing queue with these packets, requests a bus transfer, and transmits upon
grant of the bus by the bus controller. The receiving processor fills the incoming
queue associated with the bus and issues a microinterrupt to its own CPU. The
CPU checks the BRT entry accordingly. The BRT entries are four words that
include a transfer buffer address, a sequence number, and the sender and receiver
processor numbers. Error recovery action is to be taken in case the transfer is
not completed within a time-out interval. These parameters are placed on a register
stack and are dynamically updated so that the SEND instruction is interruptible
on packet boundaries.
AILI/O is done on a direct memory access basis through a microprogrammed,
multiplexed channel with a block size determined by the individual controller.
All the controllers are buffered so that all transfers over the I/O channel are at
memory speed (4M bytes’s) and never wait for mechanical motion since the
transfers always come from a buffer in the controller rather than from the actual
HO device. There exists the 1/0 Control Table (TOC) in the system data space of
each processor that contains a two-word entry for each of the 256 possible I/O
devices attached to the [/O channel. These entries contain a byte count and virtual
address in the system data space for IO data transfers. The I/O channel moves the
LOC entry to active registers during the connection of an I/O controller and
restores the updated values to the [OC upon disconnection. The channel transfers
data in parallel with program execution. The memory system priority always
permits 1/O accesses to be handled before CPU or dynabus accesses.
The dual-ported I/O device controllers provide the interface between the I/O
channel and a variety of peripheral devices. Each controller contains two in-
dependent [/O channel ports so that it can never simultaneously cause failure of
both ports. Each port attached to an I/O channel must be assigned a controller
number and a priority distinct from other ports attached to the same I/O
channel. Logically only one of the two ports of an I/O controller is active; the
other port is utilized only in the event of a path failure to the primary port. Ifa
processor determines that a given controller is malfunctioning on its [/O channel,
it can issue a command that logically disconnects the port from the controller.
This does not affect the ownership status. If the problem is within the port, an
alternate path can be used.
Each disk drive in the system may be dual-ported. Each port of a disk drive is
connected to an independent disk controller. Each of the disk controllers are also
dual-ported and connected between two processors. A string of up to cach drives
(four mirrored pairs) can be supported by a pair of controllers in this manner.
The disk controller is buffered and absolutely immune to overruns. All data
WWW.Gitmgurgaon.blogspot.com
710 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
transferred over the bus is parity checked in both directions, and errors are
reported via the interrupt system. A watchdog timer in the 1/0 channel detects ifa
nonexistent I/O controller has been addressed, or if a controller stops responding
during an 1/O sequence. In case of channel failure, the path switching between
devices and controller is demonstrated in Figure 9.36, where an alternate path is
chesen to provide the access of the disk.
The operating system of Tandem is called the Guardian. It is a “nonstop”
operating system designed to achieve the following capabilities:
1. It should be able to remain operational after any single detected module or bus
failure,
2. It should allow any medule or bus to be repaired on-line and then reintegrated
into the system.
3. It is to be implemented with high reliability provided by the hardware but not
negated by software problems.
4. It should support all possible hardware configurations, ranging from a two-
processor, diskless system through a sixteen-processor system with billions of
bytes of disk storage.
5. Itshould hide the physical configuration as much as possible so that applications
could be written to run on a great variety of system configurations.
The Guardian resides in each processor but is aware of all other processors.
In fact, the operating systems in different processors constantly monitor each
other's performance. The instant one processor's operating system fails to respond
correctly, other processors assume that it is failing and take over its work load.
Obviously, this requires a great deal of communication among the processors. This
requires a process to be able to address the system resources by a logical name
rather than by a physical address. The Guardian operating system is designed in a
top down manner with three levels of well-defined interfaces, as shown in Figure
9.37. It is based on the concept of processes sending messages to other processes.
Ail resources in the system are considered to be files, and each resource has a
logical file name. Communication between an application process and any
resource {disk, tape, another process, etc.) is via the file system. The file system
knows only the logical name of the intended recipient of a message. It passes the
message to the message system. which then determines the physical location of the
recipient. The message system is a software analog of the dynabus. It handles
automatic path retries in case of path errors. Because application programs deal
only with logical file names, the system offers total geographic independence of
resources. The programmer views this multiprocessor system as a single processor
with resources available through file system calls.
The processes and messages are further elaborated with abstraction. Each
processor module has one or more processes residing in it. A process is initially
created in a specific processor and may not execute in another processor. Each
process has an execution priority assigned to it. Processor time is allocated to the
(8261 “OUT ‘sayndwioc wiapue] yo Asaqino>) ‘aanyley Jeqfoaqued 9/7 GO yoyiMs yEd ayeusaypY GEG 223g
ay
S
(—
> 1a4JO2]U0
KC = J3]joLOS
BUFRJOA SUlA[GA
pas0.Lipy
WWW.Gitmgurgaon.blogspot.com
puoi
a SI I34JOnUOD > 4af[O11U07)
ssanoid $sa301d ssao0id ssasoid
¥SIp 4sip 4SIP ysip
Paquasugy Ye poquosugA paquosuq ~~ ” paquasuq
syurodysay suiodysay->
dO Ado Nd Ado
PL J |
TIZ COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
i
L. i
Requestor i ' Message —————e}] : Server
k e
n
RL
Requestor Pomme PALA CODIE creme . n Server
dk
BL Wl
roi ri
Requestor C7 ge Result copied in Server
ak tk
k e
LINK has been successfully completed, both processes are assured that sufficient
resources are in hand to complete the message transfer. Furthermore, a process
may reserve some control blocks to guarantee that it will always be able to send
messages to respond to a request from its message queue. Such a resource contro]
assures that deadlocks are prevented by complex producer-consumer interactions.
The Guardian is constructed of processes which communicate with messages.
Fault tolerance is provided by duplication of both hardware and software com-
ponents. Access to [/O devices is provided by process pairs consisting ofa primary
process and a backup process. The primary process must check out state informa-
tion to the backup process so that the backup may take over on a failure. Requests
to these devices are routed using the logical device namie or number so that the
request is always routed to the primary process. The result is a set of primitives
and protocols which allow recovery and continued processing in spite of single
failures in bus, processor, or I/O device.
A “network” system can link up to 255 Tandem-16 systems. The Guardian-
Expand Network system can extend the dynabus into a long-range network. To a
user at a terminal, the entire network appears to be a single Tandem-16 system.
The network maintains the geographic independence of resources. Any resource
in the network can be addressed by its logical file name without regard for its
physical location. However, a configuration option allows users to reserve pro-
cessors for local processing requirements, thereby excluding those processors from
the network.
WWW.Gitmgurgaon.blogspot.com
F14 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Central memory
LA A
t mi
Communication
cpue ly” & CcPUl
a control 7
{ Ve
CPU-YO
h , a
i+ '
SSD 108
j
1
Data path Mass storage Figure 9.39 Cray X-MP overall system
—~— Control path devices organization. {Courtesy of Cray Research,
Jac., 1983.)
EXAMPLE MULTIPROCESSOR SYSTEMS 715
processors. Each CPU has an internal structure very similar to Cray-1. However,
there are three ports per CPU (instead of two ports in Cray-1). The extra port is
added to allow communication between the two CPUs via a communication and
contro] unit.
Shared central memory The two processors share a central bipolar memory with
4M 64-bit words. This shared memory is organized in 32-way interleaved memory
banks (twice that of Cray-f). All banks can be accessed independently and in
parallel during each machine clock period. Each processor has four parallel
memory ports (four times that of Cray-1) connected to this central memory, two
for vector fetches, one for vector store, and one for independent I/O operations.
The muluport memory has built-in conflict resolution hardware to minimize the
delay and maintain the integrity of all memory references to the same bank at the
same time, from all processor’s ports. The interleaved multiport memory design
coupled with shorter memory cycle time provides a high-performance and balanced
memory organization with sufficient bandwidth (eight times that of Cray-1) to
support simultaneous high-speed CPU and I/O operations.
The CPU of X-MP Throughout the two CPUs, 16-gate array integrated circuits
are used. These circuits, which are faster and denser than the circuitry used in the
Cray-1, contributed to a clock cycle time of 9.5 ns and a memory bank cycle time of
38 ns. Proven cooling and packaging techniques have also been used on the
Cray X-MP to ensure high system reliability.
Each CPU is basically a Cray-1 processor with additional features to permit
multiprocessing. Within each CPU are four instruction buffers, each with 128
16-bit instruction parcels, twice the capacity of the Cray-1l instruction buffers.
The instruction buffers of each CPU are loaded from memory at the burst rate of
8 words per clock period. The contents of the exchange package have been aug-
mented to include cluster number and processor number. Increased protection of
data is also made possible through a separate memory field for user programs and
data. Exchange sequences occur at the rate of 2 words per clock period on the
X-MP.
Operational registers and functional pipelines are among the features pro-
viding compatibility with the Cray-1. There are 13 functional pipes and A, B, S, T,
and V registers as in Cray-1. With a basic machine cycle of 9.5 ns, the X-MP is
capable of an overall instruction issue rate of over 200 MIPS. Computation rates
of over 400 megaflops are possible, and combined arithmetic/logical operations
can exceed 1000 million operations per second.
WWW.Gitmgurgaon.blogspot.com
716 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
both, either, or none of the processors. The cluster may be accessed by any pro-
cessor to which it is allocated in either user or system mode. A 64-bit real-time clock
is shared by the processors.
Solid-state storage device (SSD) A new and large CPU-driven solid-state storage
device (SSD) is designed as an integra] part of the mainframe with very high block
transfer rate. This can be used as a fast-access device for large prestaged or inter-
mediate files generated and manipulated repetitively by user programs, or used by
the system for job “swapping” space and temporary storage of system programs.
The SSD design with its large size (32 megawords), typical rate of 1000 megabytes/s
(250 times faster than disk), and much shorter access time (less than 0.5 ms, 100
times faster than disk), coupled with the high-performance multiprocessor design,
will enable the user to explore new application algorithms for solving bigger and
more sophisticated problems in science and engineering. The concept of SSD is
illustrated in Figure 9,40. lt performs much better than that of the disk due to its
shorter access Ume and faster transfer rate.
1/0 subsystem (IQS) The 1/O subsystem, which is an integral pari of the X-MP
system, also contributes to the system’s overall performance. The I/O subsystem
(compatible with Cray-1/2)} offers parallel streaming of disk drives, 1/O buffering
(8 megawords maximum size) for disk-resident and buffer memory-resident
datasets, high-performance on-line tape handling, and common device for
front-end system communication, networking, or specialized data acquisition,
The IOP design enables faster and more efficient asynchronous I/O operations for
data access and deposition of initial and final outputs through high-speed channels
(each channel has a maximum rate of 850 megabytes/s, and a typical rate of
40 megabytes/s, 10 times faster than disk), while relieving the CPUs to perform
computation-intensive operations.
CPU le LO
&
cpu Yo Yo yo yo
& Figure 9.40 The concept of SSD in
ssD User bsor User User User Cray X-MP as compared with the
use of disks.
EXAMPLE MULTIPROCESSOR SYSTEMS 717
Disks
DD-29
4 MB/s
Mainframe
2 CPUs
10S SsD
40 MB/s
4MW
Tapes
Figure 9,41 The data flow and transfer rates in Cray X-MP. (Courtesy of Cray Research, Inc., £983.)
interfaces compensate for differences in channel widths, word size, logic levels, and
control protocols, and are available for a variety of front-end systems. The X-MP
can be connected to front-end machines like IBM/MVS, CDC NOS, and NOS/BE,
systems.
The data flow patterns and data transfer rates among the mainframe (two
CPUs and main memory of 4 megawords), the SSD, the IOS, the externa! disks
and tapes, and the front-end system are illustrated in Figure 9.41. The high-speed
transfer rates between the mainframe and SSD (1000 megabytes/s) and between
the mainframe and IOS (80 megawords/s) make the system suitable for solving
large-scale scientific problems, which are both computation-intensive and I/O
demanding.
WWW.Gitmgurgaon.blogspot.com
718 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
of task initiation for multitasking. Typically O(1 #s) to O( ms) times is needed,
depending on granularity of the tasks and software implementation techniques.
The two processors can assume various flexible architectural clustering
patterns. Faster exchage for switching machine state between tasks is provided.
Hardware supports the separation of memory segments for each user’s data and
program to facilitate concurrent programming.
Letp = 2 be the number of physical processors in the system. Listed below are
various processor clustering patterns that are challenged in the design of X-MP.
The Cray Operating System (COS) has been designed to contro] the allocation of
the clusters of shared registers to the CPU in either user or supervisor mode.
. Each cluster contains a unique set of shared data and synchronization registers
for the intercommunication of all processors in a cluster.
5. Each processor in a cluster can run in either monitor or user mode controlled
by the operating system.
6. Each processor in a cluster can asynchronously perform either scalar or vector
operations dictated by user programs.
7, Any processor running in monitor mode can interrupt any other processor and
cause it to switch from user mode to monitor mode.
8. Detection of system deadlock is provided within the cluster.
Example 9.4 Vector computations on X-MP are illustrated in Figure 9.42 for
the following computation:
A=B+58*D
where boldface letters denote vector quantities. Using three memory ports per
processor, the hardware automatically “chains” through all five vector
operations such that one result per clock period can be delivered.
EXAMPLE MULTIPROCESSOR SYSTEMS 719
A=Bts+*D
—_ Cray-l
Fetch B | Ist chain
Fetch D |
j Cray-l
Multiply ' 2d d chain
chai
{
Add i {
!
Store A : + Cray-l
i 3d chain
i
Cray ‘<-MP
One chain
When this book was published, software support for multitasking at the job
level (Figure 9.43a), the program level (Figure 9.44), and the loop level (Figure
9.45) were available from Cray Research. However, the feasibility of implementing
multitasking at the job-step level (Figure 9.436) is still under further study. In what
follows, we show three examples to illustrate the concept of multitasking, m
particular, for the X-MP multiprocessor system.
WWW.Gitmgurgaon.blogspot.com
720 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
CPU-0 CPU.
Job I Job2
CPU CPU-]
Compile A Compile B
Independent steps
within one job
Load A Load B
CPU-0 CPU-1
Main
Sub-A Sub-B
| Sub-C
| |
Sub-D
DO L1=.,N
—— | CONTINUE
oo™.
CPUO CPU-1
DO LI=L,N,2 DO 11=2,N,2
l CONTINLE ] CONTINUE
DO 21=1,M
DO1J=1,N
Vector
code
1 A(LJ)
= B(LJd) + C(I)
2 CONTINUE
DO 21=1,M
DO1J=1,N
Scalar
code
1 ACLS) = A(Ld-1)*A(I/d)
2 CONTINUE
Example 9.6 Multitasking by processor pipelining (macropipelining imtro-
duced in Chapter 1) is illustrated in Figure 9.47 for the following loop com-
putations, where $1 and S2 stand for the two vector computations involved.
DO11=2,N
A(l) = A(l~ 1) + BC) $1
D(I) = A(l) + C(D $2
1 CONTINUE
WWW.Gitmgurgaon.blogspot.com
722 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Figure 9.46 Multitasking in Example 9.5 among two CPUs in Cray X-MP.
vectorized, and the vector length is becoming longer, an even better performance
can be achieved.
With a 9.5-ns clock rate, the peak speed of one CPU in X-MP is 210 megaflops
and that of using two CPUs (dedicated to multitasking ofa single large job) is 420
megafiops. We assume one unit to be based on compiler-generated code running
on Cray 1/S. This means that the “minimum” |-CPU rate is | and that of 2-CPU
1s 2, Let “typical” be the cases of small-to-medium size vectors encountered in
typical programs, Let “maximum” refer to the cases of very long vectors. The
performance of various types of programs relative to that of Cray 1/S is sum-
marized in Table 9.4. The unit for 1/O rate is based on measured time per sector.
EXAMPLE MUI.TIPROCESSOR SYSTEMS 723
=2
[=3
Si
AQ)
L=4
SI
AQ)
lex5
SI
52
CPU-0- . sl ING
$2 AC) sm ee we Ftc
CPU 5)
CPU-O 50
CPUs
Figure 9.47 Multitasking by processor pipelining in Example 9.6,
WWW.Gitmgurgaon.blogspot.com
724 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
A=B Lt L& 21
Awz-B+C+D LS 27 2
Az=B+C«D Ld 29 3.6
As B+ssD 1.3 3.0 40
AmB+C4+D4E L3 2.3 2.7
AzB+C+ Dak 16 25 29
A=BaC+ DaE 3 25 3
A=B4+CaD+ Fuk Ls 2.1 2.2
Typical 15 25 3.0
Theoretically, we can roughly analyze the total speedup of X-MP over a scalar
processor as follows: Vectorization offers a speedup of 10 to 20 over scalar pro-
cessing, depending on actual code and vector length. Multitasking offers an
additional speedup of S,, < 2, depending on task size and relative multitasking
overhead. The total speedup over scalar processing is thus equal to. S = S,, x (10
to 20). In the benchmark, SPECTRAL, for short-term weather forecasting, the
actual 2-CPU speedup has been measured as S,, = 1.89 over ]-CPU. Therefore,
we obtain S = 18 to 38 under the assumption that S,, = 1.8 to 1.9.
We describe below a model developed by Ingrid Bucher (1983) to evaluate the
performance of Cray X-MP in environments with different work loads. Work
loads on the X-MP can be characterized by three types of execution requirements:
scalar mode, vector mode, and concurrent mode, The scalar mode is characterized
by a process code being executed sequentially either for reasons of logic or because
it is too costly to vectorize. In the vector mode, a process code is vectorizable and
executed in the vector section of the processor. In this mode the process granularity
is small. In the concurrent mode, the process code ts decomposed into cooperating
processes and can be executed on more than one processor. The process granularity
is large enough to overcome the communication and process creation overheads.
Let S,. S,, and S, represent the rates at which a machine can execute scalar,
vector, and concurrent codes, respectively. Also, let F,, F,.. and F, be the fractions
of the work load that can be executed only in scalar, vector, and concurrent modes,
respectively. Therefore the time required to execute this work load is proportional
to
F,
Palisa lea o£
es
€ + + (9.1)
EXAMPLE MULTIPROCESSOR SYSTEMS 725
where S,,, is the work load~dependent effective speed of the machine. Equation 9.1
implies that the slowest of the execution speeds will critically influence the effective
speed unless the fractional work load F associated with it is negligibly small.
The weight factors F,, F,, and F. have to be determined empirically from the
work load and adjusted by projections of how the work load will evolve in the
future. In general, the choice of machine configurations may influence the charac-
teristics of the work load. For example, a machine with high concurrent speed
may encourage more Monte Carlo simulations. A satisfactory measure that is
independent of machine architecture and compiler optimization and characterizes
the amount of computational work done is desirable, A generally accepted metric
for the execution speed of supercomputers is the megaflops. However, this metric
does not include much of the work done. Examples of such work are the logical
operations, integer arithmetic, and table hookups. Because of the highly parallel
architecture of each processing unit, these operations may often be performed in
parallel with the floating-point work. Nevertheless, we adopt the megaflops
measure.
In the vector mode, the time required to perform operations on vectors of
length N is a linear function of N given by
T= Toon + NA (9.2)
where T,,,,, 18 the startup time for the vector operation and A is the time per result
elements, For example, the Cray X-MP has shorter startup and element times for
its vector operations than the Cray-1S and therefore clearly has the superior vector
processor. The vector execution speed can be estimated by
. N l
Se = FRG Tecari 0:3)
Note, however, that the average time to process a vector length V has a more com-
plicated relationship than Eq. 9.2, because vectors of length N > 64 are stripmined
in sections of length 64.
Not all vector operations follow the simple relationship shown in Eq, 9.2.
For the Cray-15, vectors must have a constant stride (distance between memory
locations), but the stride need not be 1, However, for many repetitive operations,
data are not stored in memory in a regular pattern. Hence we can identify these
types of vector operations:
Operands stored in irregular locations must be gathered, and the results may have
to be scattered back into memory. The Cray-1S8 performs, gathers, and scatters in
WWW.Gitmgurgaon.blogspot.com
726 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
scalar mode only. Therefore, we can modify the execution rate in the vector mode
by
1 F, Fy. F;
S, Ser * Sia * Soa e4)
where S,,, 5,3, and S,3 are the vector speeds for operands stored in contmuous
memory location, locations with constant stride > 1, and random locations,
respectively. F,, F,, and F; are the corresponding fractions of the vectorizable
work load,
The number of /oads and srore per floating-point operation greatly influences
the vector speed. For some typical vector codes, measurements indicate that 0.6
to 1.0 loads and 0.2 to 0.5 stores were observed per floating-point operation,
Therefore, the floating Fortran statement
VIC) = Sl * V2(1) + $2* V3(1) (9.5)
produces the typical code. Further, two such statements are contained in a typical
DO loop, reducing the startup time for the other loop per floating-point operation
by a factor of 2 for the X-MP.
Table 9.6 contains values for vector speeds for a work load similar to that at
the Los Alamos Laboratory. Table 9.7 indicates the values for a more ideal work
load. Note the effective speeds are degraded by only small amounts due to the slow
components. Measurements of the scalar speed S, is performed by running a
bench program. For example, the scalar speeds for Cray-18 and one of two pro-
cessors of the Cray X-MP are 4.2 and 5.4, respectively.
For the asynchronous concurrent mode, there are communication overheads
and portions of the parallel algorithm that must be executed sequentially. Let F,
be the fraction of code that can be executed in parallel mode on an arbitrary
number of P processors, with the remainder F,,, of it to be executed sequentially.
Then the execution time of the concurrent algorithm is proportional to
1 FE OF,
5. wea P te
os S, + tt Sy + Pa,
* comm (9 . 8)
Sy Su 8.3 5,
Vector speed Vector speed Vector speed Effective
for continuous for constant for random vector
Machine vectors stride access speed
Cray-LS 38 56 5 cra)
Cray X-MP* 107 1Ol 6 79
Cray-1$ 66 65 3 66
Cray X-MP* 133 132 6 133
where S, is the speed for executing the sequential portion of code and Ty yum is the
communication overhead, Using Eq. 9.1 and results described earlier, we can
compile characteristic speeds of Cray-1§, Cray X-MP, and some hypothetical
machines for two work loads with differing characteristics. In Tables 9.8 and 9.9,
the machines in quotes are hypothetical machines and the numbers in parentheses
are postulated numbers for hypothetical machines.
From these tables, it is obvious that the effective speed ofa supercomputer is
strongly work load-dependent. The slowest characteristic speed will affect the
effective speed critically unless the fraction of the work load associated with that
speed is negligibly small. It acts like a bottleneck. The most effective way to speed
up a machine is to increase this speed or to decrease the fraction of work associated
with it. The results also show that speeding up the fastest characteristic speed ofa
supercomputer will markedly improve its effective speed, only if the fraction of the
work load running at that speed is close to 1. [f this is not the case, the installation
of additional vector pipelines on a vector computer will not be effective.
In summary, the Cray X-MP has 8 times Cray-1 memory bandwidth with
guaranteed chaining of linked vector operations. Compared to Cray-1, the X-MP
offers 1.25 to 3.75 speedup for single job and 2.5 to 5 times throughput on CPU
Machine S, 5, 8, Sea
Cray-18 4.2 46 42 92
Cray X-MP {1 processor} SA4 79 5.4 12.2
Cray X-MP (2 processors} 5.4 (158) (10.83 (£6.8)
“Cray X-MP" (4 processors) 54 (316) (21.6) 0.7)
WWW.Gitmgurgaon.blogspot.com
728 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Machine S, 5, 5. Seg
Cray-IS 42 46 42 16.7
Cray X-MP (1 processor} 54 133 54 23.2
Cray X-MP (2 processors) 54 (2663 (10.8) (32.5)
“Cray X-MP" (4 processors) 34 {532} (216) (40.6)
dominated job mix. The nmprovement in speed is due to two processors scheduled
by the COS, shorter clock period, higher memory bandwidth, and guaranteed
chaining. Like the Cray-1, the X-MP is good for both short and long vector pro-
cessing. A general guideline to explore the computing power in X-MP is to partition
tasks at the highest level to apply multitasking and then to vectorize tasks at the
lower levels as much as possible,
The Cray-2 Cray Research, Inc. is currently developing the Cray-2, which is
expected to have 6 times speedup in scalar and 12 times speedup in vector opera-
tions over the Cray-1, Cray-2 is planned to have four processors using 32 M words
ofmain memory, The CPU cycle time is targeted to be 4ns. The 1/O will be mmproved
20 times from current Cray-| capability. It has been suggested that 16-gate ECL
chips will be used in Cray-2. Highly densed logic and memory modules will be
cooled by immersion in inert fluorocarbon liquid. The longest wire length is
confined to I6 in. The system is planned to be housed in a circular frame 38 inches
in diameter and 26 inches in height, a rather condensed size for a supercomputer.
The Cray-2 1s expected to become available in the late 1980s.
The original design of C-mmp architecture is described in Wulf et al. (1972). The
final C.mmp architecture 18 described in Fuller and Harbison (1978). The ex-
perience of the C._mmp among other multiprecessors is presented in Jones and
Schwartz (1980), Reliability of Cmmp has been studied in Siewiorek et al. (1978).
The Hydra operating system is described in Wulf et al. (£981). Oleinick (1978)
studied parallel algorithms for the C:mmp. Marathe and Fuller (1977) evaluated
the C.mmp architecture and the Hydra kernel.
The S-1 multiprocessor has been reported in Widdoes (1980). Lawrence
Livermore National Laboratory has published a series of reports on the S-1
project at the Lawrence Livermore National Laboratory (1981). IBM System/370
architecture was assessed in Case and Pedegs (1978). IBM 3033 and 3081 are
EXAMPLE MULTIPROCESSOR SYSTEMS 729
described in IBM (1978, 1983). Programming and operating system design con-
siderations of tightly coupled IBM multiprocessor system are given in Arnold et al.
(1974) which includes an overview of the OS/VS2 MVS. Functional characteristics
of 370/168 can be found in the IBM manual IBM (1979). System description of the
Denelcor HEP computer is extracted from technical notes by Denelcor, Inc.
(1983) and a report by Smith and Fink (1980). Jordan (1983) has studied the per-
formance of the HEP. The material on the Cray X-MP and Cray-2 is based on
Chen (1983) and some technical presentations by Cray Research, Inc. Bucher
(1983) developed the performance models for Cray X-MP.
The evolution of the Sperry Univac 1100 series is reported in Borgerson et al.
(1978). The detailed description of Univac 1100/80 systems can be found in several
Sperry Univac manuals on the processor and storage, hardware system, program-
ming, and executive systems. An introductory treatment of commercial multi-
processor systems, including the IBM 370/168, CDC Cyber-170, Honeywell
Series 60 Level 66, Univae 1100/80, Burroughs B7700, and DEC System 10 Model
KL-16, C.mmp, and C.m*, is given in Satyanarayanan (1980) plus an annotated
bibliography up to 1979. Surveys of earlier multiprocessors can be found in
Enslow (1974, 1977). A recent tutorial text on supercomputers by Hwang and
Kuhn (1984) covers recent vector processors and multiprocessor systems. The
Tandem Nonstop system is described in Katzman (1978). Recently, there are
a number of research multiprocessors reported by Gajski et al. (1983), Gottlieb
et al, (1983), and Fritsch et al. (1983).
Problems
9.1 Determine the evaluation time of the arithmetic expression
WWW.Gitmgurgaon.blogspot.com
730 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
among al! the possible configurations in terms of hardware requirements and expected network
performance.
9.3 Consider the storage ofa symmetric n-by-n matrix A = (a,;) in a memory system with » parallel
modules, where nis a perfect square. Devise storage schemes to satisfy each of the following require-
ments. It is assumed that only [1/2] rows of the matrix are to be stored ifn is odd, and fn/2] + 1 rows
are needed ifn is even, Each memory module has a separate index register to keep track of the allocation,
{a) It is required to access any row or any column in one memory cycle. For exampie, we want to
access the elements a,4, 4,3, and a,, in one cycle, or the elements ag3. asa, ga, G7a+ gy, and a,, in
another cycle,
(b) It is required to access any row or any nonoverlapping square of biocks as in the example
below:
(c) Mlusirate your memory allocation schemes for (a) and (8) with N = 9and N = 16, respectively,
Ts it possible to achieve (a) and (b) with the same scheme?
9.4 Answer the following questions associated with the C.mmp system:
(a) Explain the special system instructions HALT, RESET, WAIT, RTI, and RTT developed
for the PDP-if processors in the C.mmp.
(b) Explain the function of the Dmap and of the interprocessor bus installed in the Cmmp,
(c) What are the special features built into the Hydra operating system?
9.8 Answer the following questions on the $-1 multiprocessor project:
(a) Explain the use of separate data cache and instruction cache in the pipelined design of the S-1
Mark HA uniprocessors.
(b) Explain the virtual-to-physical address translation scheme used in the S-1 system.
(c) What are the speciai features in the Amber operating system that facilitates multiprocessing?
9.6 Answer the following questions for the HEP multiprocessor:
(a) Distinguish between the conventional SISD pipelining and the MIMD pipelining intreduced
in the HEP.
(b) Explain the design and priority operations of the packet-switching interconnection network
developed in the HEP.
(c) Explain the synchronization and protection mechanisms developed in the HEP.
9.7 Answer the following questions for the IBM multiprocessors:
(a) Distinguish the Attached Processing (AP) mode from the Multiprocessing (MP} mode in
various IBM multiprocessors,
(b) What are the improvements of the IBM 3081 over the TBM 370/168MP and TBM 3033 in
both technologies and designs?
(c) What are the multiprocessing features in the IBM OS/VS2 operating system?
9.8 Answer the following questions for the Univac 1100 series:
(a) Describe the increase of multiprocessing capability in various M « N configurations of the
Univac 1160/80.
(o) What are the improvements made in Univac 1100/90 multiprocessors over the Univac
1100/80 modeis?
(c} Explain the functional structure of the kernel! of the EXEC operating system,
EXAMPLE MULTIPROCESSOR SYSTEMS 731
9.9 Answer the following questions for the Tandem nonstop system:
(a4) Why can the Tandem multiprocessor tolerate all single failures in the system?
(b) Explain the alternate path switching between devices and controllers in the Tandem/16.
(c) Describe the message system developed in the Guardian operating system,
9.10 Answer the following questions for the Cray X-MP system:
(a) Explain the inter-CPU communication structure in the Cray X-MP-.
(b) Explain the functions of the Solid-State Storage Device (SSD) and of the 1/0 Subsystem
(IOS) in the Cray X-MP,
(c) Explain the improvements made in the Cray X-MP over its predecessor, the Cray-1,
9.11 A computing system has 10 tape drives available. All jobs that run on the system require a maxi-
mum of four tape drives to complete, but we know that they start by running for a long period with only
three; they request the one remaining drive for a short period when needed near the end of their opera-
tion, (There is an endless supply of these jobs.)
(a) If the job scheduler operates with the policy that it will not start a job unless there are four
unassigned drives, and it assigns those four drives to a job for its entire duration, what is the maximum
number of jobs that can be in progress at once? What are the minimum and maximum number of
drives that may actually be idle as a result of this policy?
(b) Figure outa better scheduling policy to improve the drive utilization rate and at the same time
to avoid a system deadlock. What is the maximum number of jobs that can be in progress in your new
policy? What are the minimum and maximum numbers of drives that may be idle as a result of this
policy?
9.12 Chained vector tne (chine) is a useful term for discussing vector operation timing, Cray X-MP
can combine several chimes implementable on a Cray-1 into a single long chime, Give an example
vector computation sequence (more involved than that shown in Example 9.4) to show the advantages
of using X-MP over the use of Cray-].
WWW.Gitmgurgaon.blogspot.com
CHAPTER
TEN
DATA FLOW COMPUTERS AND VLSI
COMPUTATIONS
Computer architects have been constantly searching for new approaches to design-
ing high-performance machines, Data flow and VLSI offer two mutually support-
ive approaches towards the design of future supercomputers. In this chapter, we
study the requirements of data-driven computations, functional programming
languages, and various data flow system architectures that have been challenged
in recent years. In the VLSI computing area, we introduce topological structures
of multiprocessor arrays for large-scale numeric computations and for symbolic
manipulations. Techniques for directly mapping parallel algorithms inte hardware
structures will be studied. VLSI architectures for designing large-scale matrix
arithmetic solvers are presented based on matrix partitioning and algorithmic
decomposition. Potential applications of some of these VLSI computing structures
are demonstrated for real-time image processing.
Data flow computers are based on the concept of data-driven computation, which
is drastically different from the operation of a conventional von Neumann ma-
chine. The fundamental difference is that instruction execution in a conventional
computer is under program-flow control, whereas that in a data flow computer is
driven by the data (operand) availability. We characterize below these two types
of computers. The basic structures of data flow computers and their advantages
and shortcomings will be discussed in subsequent sections.
Jack Dennis (1979) of MIT has identified three basic issues towards the
development of an ideal architecture for future computers. The first is to achieve
a high performance/cost ratio; the second is to match the ratio with technological
progress; and the third is to offer better programmability in application areas.
732
DATA FLOW COMPUTERS AND VLSL COMPUTATIONS 733
The data flow model offers an approach to meet these demands. The recent
progress in the VLSI microelectronic area has provided the technological basis
for developing data flow computers.
Program
memory
dad
Program Fork
memory wnsl
o v
if , .
!i ‘
% D;
ata
: : i a i memory
' ' '
Ye Data i —
fmt b
2 + memory wot t
f | bs h i} a
: | 7 a 4
! fis t : "S Go
iy a : to
a+]
oY -
b 4 1ta worl 8
{ c i
{ °F ™ ms ¢
1; | 642 t 2) 1 Pl
of bd “
i #2 ; 4
at2 | Pl ad
ile
4 L a + hoefF ” f
i 1.4 > 4
' 2 aa m4
'
| 44 i 4
, & | Join
i |:. ae!
‘ gal 2
‘
*
m f' l
14
P] tq ef] a
t an
’
{
y
WWW.Gitmgurgaon.blogspot.com
734 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
(8) (c)
y + C) 1 BAI a ~ Cy ¢) Af2
+
a) () () 8/1
(a
This data-driven concept means asynchrony, which means that many instruc-
tions can be executed simultaneously and asynchronously. A high degree of
implicit parallelism is expected in a data flow computer. Because there is no use
of shared memory celis, data flow programs are free from side effects. In other
words, a data flow operation is purely functional and produces no side effects
such as the changes of a memory word. Operands are directly passed as ‘“‘tokens””
of values instead of as ‘‘address” variables. Data flow computations have no
far-reaching effects. This locality of effect plus asynchrony and functionality make
them suitable for distributed implementation.
Information items in a data flow computer appear as operation packets and
data tokens. An operation packet is composed of the opcode, operands, and
destinations of its successor instructions, as shown in Figure 10.2. A data token
is formed with a result value and its destinations, Many of these packets or tokens
are passed among various resource sections in a data flow machine. Therefore,
the machine can assume a packet communication architecture, which is a type
of distributed multiprocessor organization.
Data flow machine architectures Depending on the way of handling data tokens,
data flow computers are divided into the static model and the dpnamic model, as
introduced in Figure 10.4 and Figure 10.5, respectively. In a static data flow
WWW.Gitmgurgaon.blogspot.com
736 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
@O=- Sac
t
+03 -
Step 1
\ |
+0) ~
(4) (2)
Step 2
+(1} _
Step3
Figure 10,3 Three snapshots of the dataflow computation for@ = (4 + f)*(b — ¢).
DATA FLOW COMPUTERS AND VLS! COMPUTATIONS 737
: Memory unit ‘
« Gnstructions) .
one oon
Update wh teen ene ewene mee eneeesee ge mesa serene connie’ Fetch
unit ({nstruction address) unit
4
eee eee
(Data
tokens) ,
Memory unit
(instructions)
4
oe
Matching 5
unit
(Matched token sets} Update/
fetch unit
j 4
eee eos
(Data
tokens) ¥ i
WWW.Gitmgurgaon.blogspot.com
738 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
machine, data tokens are assumed to move along the ares of the data flow program
graph to the operator nodes. The nodal operation gets executed when all its
operand data are present at the input arcs. Only one token is allowed to exist on
any arc at any given time, otherwise the successive sets of tokens cannot be
distinguished. This architecture is considered static because tokens are not labeled
and centro! tokens must be used to acknowledge the proper timing in transferring
data tokens from node to node. Jack Dennis and his research group at the MIT
Laboratory for Computer Science is currently developing a static data flow
computer.
A dynamic data flow machine uses tagged tokens, so that more than one token
can exist in an arc. The tagging is achieved by attaching a label with each token
which uniquely identifies the context of that particular token. This dynamically
tagged data flow model suggests that maximum parallelism can be exploited from
a program graph. If the graph is cyclic, the tagging allows dynamically unfolding
of the iterative computations. Dynamic data flow computers include the Man-
chester machine developed by Watson and Gurd at the University of Manchester,
England, and the Arvinds machine under development at MIT, which is evolved
from an earlier data flow project at the University of California at Irvine.
The two packet-communication organizations are based on two different
schemes for synchronizing instruction execution. A point of commonality in the
two organizations is that multiple processing elements can independently and
asynchronously evaluate the executable instruction packets. In Figure 10.4, the
data tokens are in the input pool of the update unit. This update unit passes data
tokens to their destination instructions in the memory unit. When an instruction
receives all its required operand tokens, it is enabled and forwarded to the enabled
queue. The fetch unit fetches these instructions when they become enabled.
In Figure 10.5, the system synchronization is based on a matching mechanism,
Data tokens form the input pool of the matching unit. This matching unit arranges
data token into pairs or sets and temporarily stores each token until all operands
are compared, whereupon the matched token sets are released to the fetch-update
unit. Each set of matched data tokens (usually two for binary operations) is
needed for one instruction execution. The fetch-update unit forms the enabled
instructions by merging the token sets with copies sent to their consumer instruc-
tions. The matching of the special tags attached to the data tokens can unfold
iterative loops for parallel computations. We shall further discuss this tagged-
token concept in later sections.
Both static and dynamic data flow architectures have a pipelined ring struc-
ture. If we include the I/O, a generalized architecture is shown in Figure 10.6.
The ring contains four resource sections: the memories, the processors, the routing
network, and the input-output unit, The memories are used to hold the instruction
packets. The processing units form the task force for parallel execution of enabled
instructions. Fhe routing network is used to pass the result tokens to their destined
instructions. The input-output unit serves as an interface between the data flow
computer and the outside worid. For dynamic machines, the token matching is
performed by the I/O section.
DATA FLOW COMPUTERS AND VLSI COMPUTATIONS 739
Input Output
eet icine —ascsnrenciisi masala Input-output asl
section
(token matching}
4
Processing
(Data tokens) section —
(processors}
Figure 10.6 A ring-stractured dataflow computer organization including the [/O functions.
Most existing data flow machine prototypes are built as an attached processor
to a host computer, which handles the code translation and I/O functions. Even-
tually, computer architects wish to build stand-alone data flow computers. The
basic ring structure can be extended to many improved architectural configura-
tions for data flow systems. For example, one can build a data flow system with
multiple rings of resources. The routing network can be divided into several
functionally specialized packet-switched networks. The memory section can be
subdivided into cell blocks. We shall describe variants of data flow computer
architecture in Section 10.2.
Major design issues Toward the practical realization of a data flow computer, we
identify below a number of important technical problems that remain to be solved:
1. The development of efficient data flaw languages which are easy to use and to be
interpreted by machine hardware
2. The decomposition of programs and the assignment of program modules to
data flow processors
3. Controlling and supporting large amounts of interprocessor communication
with cost-effective packet-switched networks
4. Developing intelligent data-driven mechanisms for either static or dynamic data
flow machines
5, Efficient handling of complex data structures, such as arrays, in a data flow
environment
6. Developing a memory hierarchy and memory allocation schemes for supporting
data flow computations
7. A large need for user acquaintance of functional data flow languages, software
supports, data flow compiling, and new programming methodologies
8. Performance evaluation of data flow hardware in a large variety of application
domains, especially in the scientific areas
WWW.Gitmgurgaon.blogspot.com
740 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Approaches to attack the above issues and partial solutions to some of them
will be presented in subsequent sections. We need first to understand the basic
properties of data flow languages. After all, the data flow computers are language-
oriented machines. In fact, research on data flow machines started with data flow
languages. It is the rapid progress in VLSI that has pushed the construction of
several hardware data flow prototypes in recent years.
Example 10.1
1. P= X + Y must wait for inputs X and Y
2.Q= P+ Y must wait for instruction | to complete
3. R= X x P must wait for instruction | to complete
4.$=R—Q > must wait for instructions 2 and 3 to complete
5. T=R x P must wait for instruction 3 to complete
6. U =S + T must wait for instruction 4 and 5 to complete
DATA FLOW COMPUTERS AND VLSI COMPUTATIONS 741
WWW.Gitmgurgaon.blogspot.com
742 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
pope
T gate F gate Merge
Figure 10.8 Operators (nodes) and links (arcs) for the construction of dataflow graphs.
Numerical data links transmit integer, real, or complex numbers and boolean
links carry only boolean values for control purposes. Figure 10.8 presents various
operator types for constructing data flow graphs. An identity operator is a special
operator that has one input arc and transmits its input value unchanged. Deciders,
gates, and merge operators are used to represent conditional! or iterative compu-
tation in data flow graphs. A decider requires a value from each input are and
produces the truth value resulting from applying the predicate P to the values
received.
Control tokens bearing boolean values contrel the flow of data tokens by
means of the T gates, the F gates, and the merge operators. A T gate passes a data
token from its data input arc to its output are when it receives the true value on
its control input arc. It will absorb a data token from its data input arc and place
nothing on its output are if it receives a false value. An F gate has similar behavior,
except the sense of the control value is reversed. A merge operator has T and F
DATA FLOW COMPUTERS AND VLSI COMPUTATIONS 743
data input arcs and a truth-value control are. When a truth value is received, the
merge actor places the token from the true input are on its output arc. The token
on the other unused input are is discarded. Similarly, the false input is passed to
the output, when the control are is false.
input xn
y= lien
while />0 do
begin y=y«x;/=/-1 end
(10.1)
Zay
output z
The successive values assumed by the loop variables » and i pass through the
links labeled in the program graph. The decider emits a token carrying the
true value each time execution of the loop body is required. When the fring
of the decider yields a false, the value ofy is routed to the output link z. Note
the presence of tokens carrying false values on the input arcs of the merge
operators. These tokens allow the merge operator to initiate execution of
the loop by passing initial values for the loop variables. The initial values of
the contrel token are marked as fadse in Figure 10.9.
Figure 10.9 The dataflow graph representation of the computation z = x” specified in Example 10.2.
WWW.Gitmgurgaon.blogspot.com
744 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Data flow graphs form the basis of data flow languages. We briefly describe
below some attractive properties of data flow languages.
Freedom from side effects This property is necessary to ensure that data dependen-
cies are consistent with sequencing constraints. Side effects come in many forms,
such as in procedures that modify variables in the calling program. The absence
of global or common variables and careful control of the scopes of variables make
it possible to avoid side effects. Another problem comes from the aliasing of
parameters. Data flow languages provide ‘‘call by value” instead of the “call by
reference.” This essentially solves the ahasing problem. Instead of having a pro-
cedure modify its arguments, a ‘“‘call by value” procedure copies its arguments.
Thus, it can never modify the arguments passed from the calling program. In
other words, inputs and outputs are totally isolated to avoid unnecessary side
effects.
WWW.Gitmgurgaon.blogspot.com
746 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
not only regularly structured but also arbitrary parallelism in programs. The
direct use of valucs instead of names of value containers (addresses) enables purely
functional programming without side effects. Asynchronous parallelism can be
exploited at the instruction level or at the procedure level. Inherently sequential
computations can be unfolded to enable parallelism. The data flow approach has
applied pipelining, array processing, and multiprocessing techniques discussed in
previous chapters for control flow computers,
The data flow language does not introduce instruction-sequencing constraints
other than the ones imposed by data dependencies in the algorithm. In theory,
maximum parallelism can be achieved if sufficient resources are provided. This
approach extends naturally to an arbitrary number of processors in the system.
The speedup should be linearly proportional to the increase of processor number.
The high concurrency in a data flow computer is supported by easier program
verification, better modularity and extendability of hardware, reduced protection
problems, and superior confinement of software errors.
Matching with VLSI technology Recall the basic architecture of a data flow
computer (Figure 10.6}. The memory section contains instruction cells which can
be uniformly structured in large-scale memory arrays. The pool of processing
units and the network of packet switches can be each also regularly structured
with modular cells. All this homogeneity and modularity in cellular structures
contributes to the suitability of VLSI implementation of major components in
a data flow computer. As introduced in Chapter |, the impressive progress in
microelectranics technology has made it possible to challenge the fabrication of
large arrays of processors, memories, and switches on VLSI chips.
The interconnection between chips can be built into highly densed packaging
systems. lt is fair to say that data flow machine architecture matches nicely with
the technological supports that we anticipate to have. The potential of VLSI and
VHSIC technologies can be fully exploited in the development of data flow
machines. The operations in a data flow computer may be asynchronous. How-
ever, the hardware components can be designed with synchronous functional
pipes and clocked memory and switch arrays. With more lessons to be learned
and the data flow hardware properly evaluated, it would be appropriate to con-
sider VLSI implementation of some large-scale data flow systems.
imperative languages like Fortran and Pascal. This is especially true when the
computing environment demands a high degree of parallel processing to achieve
a prespecified level of performance. Intuitively, this assertion is valid for certain
algorithm constructs and work-load distributions. However, more empirical
results are needed to prove its validity for genera] scientific computations.
Shortcomings of data flow computing Critics of the data flow approach have
pointed out quite a number of potential problems in the development and appli-
cation of data flow computers at the instruction level. It is instructional to learn
from these reserved positions and to explore other alternatives to achieve high per-
formance. In the conventional computer with centralized control hardware, an
imperative language such as Fortran is used and an intelligent compiler is needed
to normalize the program and to generate the dependence graph, which guides
the vectorization and optimization processes we have studied in previous chapters.
High-level use of the dependence graph is practiced here primarily at compile
time. The major advantages of this high-level approach are summarized below:
l. The data driven at instruction level causes excessive pipeline overhead per
instruction, which may destroy the benefits of parallelism. The long pipeline
filling problem is attributed to queueing all enabled instructions at the input
ports of every subsystem in the data flow ring. The queue lengths absorb some of
the parallelism in a program, thus, performance becomes weak for improper
buffering and traffic congestion.
2. Data flow programs tend to waste memory space for the increased code length
due to the single assignment rule and the excessive copying of data arrays.
The damaging effects of the memory access conflict problem are so far not
well addressed by data flow researchers.
3. When a data flow computer becomes large with high numbers of instruction
cells and processing elements, the packet-switched network used becomes
cost-prohibitive and a bottleneck to the entire system.
WWW.Gitmgurgaon.blogspot.com
748 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
4. Some critics feel that data flow has a good deal of potential in small-scale or very
large-scale parailel computer systems, with a raised level of control. For
medium-scale parallel systems, data flow competes less favorably with the
existing pipeline, array, and multiprocessor computers. We shail further
discuss this assessment in Section 10.2.3.
Several interesting data flow computer architectures are studied in thts section.
The intent is to identify related architectural concepts rather than to describe the
implementation details. We shall start with static data flow computers represented
by the Dennis machine at MIT. Then we describe several ring-structured, dynamic
data flow computers, including the original Irvine machine and its successor
machine at MIT, the EDDY system in Japan, and the Manchester machine in
England. Several specially designed data flow systems will be briefly examined,
including the Utah machine, the French LAL system, and the Newcastle data-
control flow computer. Design alternatives of data flow computers will also be
discussed to inspire future development.
Jack Dennis and his associates at MIT have pioneered the area of data flow
research. They have developed a static data flow computing model and the associ-
ated language supports. There are two interesting data flow projects at MIT. For
identification purpose, we distinguish them by calling them the Dennis machine
and the Arvind machine. The Dennis machine has a static architecture, whereas
the Arvind machine is dynamic, using tagged tokens and colored activities.
Data flow graphs used in the Dennis machine must follow a static execution
rule that only one data token can occupy anh are at an instant. This leads to a
static firing rule that an instruction is enabled if a data token is present on
each of its input arcs and no token is present on any of its output ares. Thus, the
program graph contains control tokens as well as data tokens, both contributing
to the enabling of an instruction. These control tokens act as acknowledge signals
when data tokens are removed from output arcs.
The Dennis machine is designed to exploit the concurrency in programs
represented by static data flow graphs. The structure of this data flow computer
is Shown in Figure 10.10; i¢ consists of five major sections connected by channels
through which information is sent in the form of discrete tokens (packets):
« Memory section consists of instruction cells which hold instructions and their
operands.
« Processing section consists of processing units that perform functional
operations on data tokens.
DATA FLOW COMPUTERS AND VLSI COMPUTATIONS 749
Processing section
Processing
f~ unit ‘>
Processing é )
fi unit
(Control one
tokens) y 7
Control
network
(Data (Operation
tokens) packets)
Instruction
cell block
Distribution .. .‘ Arbitration
network . . . network
Instruction
cell block
Nn ell
Memory section
Figure 10.10 The static dataflow computer architecture proposed at MIT. (Courtesy of Dennis et al.,
1979.)
WWW.Gitmgurgaon.blogspot.com
750 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Instructions held in the memory section are enabled for execution by the
arrival of their operands in data tokens from the distribution network and control
tokens from the control network. Enabled instructions, together with their
operands, are sent as operation packets to the processing section through the
arbitration network. The results of instruction execution are sent through the
distribution network and the control network to the memory section, where they
become operands of other instructions. Each instruction cell has a unique address,
the cell identifier. An occupied cell holds an instruction consisting of an operation
code and several destinations. Each destination contains a destination address,
which is a cell identifier, and additional control information used by processing
units to generate result tokens. An instruction represents one or more operators
of the program graph, together with its output links. Instructions are linked
together through destination addresses stored in their destination fields.
Each instruction cell contains receivers which await the arrival of token values
for use as operands by the instruction. Once an instruction cell has received the
necessary operand tokens and acknowledge signals, the cell becomes enabled and
sends an operation packet consisting of the instruction and the operand values
to the appropriate processing unit through the arbitration network. Note that
the acknowledge signals are used to correctly implement the firing rule for pro-
gram graphs.
The arbitration network provides a path from each instruction cell to each
processing unit and sorts the operation packets among its output ports according
to the operation codes of the instructions they contain. For each operation packet
received, a processing unit performs the operation specified by the instruction
using the operand values in the packet and produces one or more result tokens,
which are sent to instruction cells through the control network and distribution
network. Each result token consists of a result value and a destination address
derived from the instruction being processed by the processing unit. There are
control tokens containing boolean values or acknowledge signals, which are sent
through the control network, and data packets containing integer or complex
values, which are sent through the distribution network.
The two networks deliver result tokens to receivers of instruction cells as
specified by their destination address fields; that is, data packets are routed
according to their destination address. The arrival of a result token at an instruc-
tion cell either provides one of the receivers of the cell with an operand value or
delivers an acknowledge signal; if all result tokens required by the instruction in
the cell have been received, the instruction cell becomes enabled and dispatches
its contents to the arbitration network as a new operation packet.
The functions performed by the processing unit are distributed among several
sections of the data flow processor. The operations specified by instructions are
carried out in the processing section, but control of instruction sequencing is a
function of the control network, and the decoding of operation codes is partially
done within the arbitration network. The address fields (destination addresses) of
instructions specify where the results should be sent instead of addressing a shared
BATA FLOW COMPUTERS AND VLSI COMPUTATIONS 751
memory cell. Instead of instructions fetching their operands, the operand values
are sent to the instructions.
All communication between subsystems in the Dennis machine is by packet
transmission over the channels. The transmission of packets over each channel
uses an asynchronous protocol so that the five sections of the computer can
operate independently without using central timing signals. Systems organized
to operate in this manner are said to have the packet communication architecture.
The instruction cells are assumed to be physically independent, so at any time
many of them may be enabled. The arbitration network should be designed to
allow many instruction packets to flow through it concurrently. Similarly, the
control network and the distribution network should be designed to distribute
dense streams of control and data packets back to the instruction cells. In this
way, both the appetites of pipelining and parallelism are satisfied. The arbitration,
distribution, and control networks of the data flow processor are examples of
packet-switched routing networks that perform the function of directing packets
to many functional units of the processor. If the parallelism represented in the
data flow graph is to be fully exploited, routing networks must have a high band-
width.
When the number of instruction cells becomes large, the three networks
shown in Figure 10.10 may become exceedingly large and thus cost prohibitive.
One approach that has been suggested to overcome this difficulty is to use the
concept of cell blocks. A cell block is a collection of instruction cells which share
the same set of input and output ports from the distribution, control, and arbi-
tration networks. The cell-block implementation and its use in the machine
architecture are demonstrated in Figure 10.11. By using shared I/O ports, the
arbitration networks can be partitioned into subnetworks of significantly smaller
sizes; so can the other networks in the system.
In Figure 10.12, we show an example design of the ceil block-structured data
flow multiprocessor system. The system consists of four processors and 32 cell
blocks. Three building blocks can be used to construct the 32 x 4 arbitration
network and the 4 x 32 distribution network. Illustrated in Figure 10.13 are the
2 x | arbiter, the 1 x 2 distributor, the 2 x 2 switch, and the 3 x 2 switch used
in the network constructions in Figure 10.12 and Figure 10.14, respectively. The
distributors are blocking free. The arbiter can pass only one input to its output,
The 2 x 2 switch has nine possible states, two of which may cause blocking.
The 3 x 2 switch has 27 states, 14 of which will cause blocking. Whenever a
blocking takes place, only one of the conflicting requests can get through the
switch.
The blocked requests will be discarded in an unbuffered network and the
requests will be resubmitted. For a packet-switched network with buffers at the
inputs of the switches or arbiters, the blocked requests will be held waiting to be
passed to the output ports ina later time. The buffered Delta networks discussed
in Chapter 7 can be modified to be used in a data flow environment. An example
buffered arbitration network of size 27 x 8 is shown in Figure 10.14. Each input
WWW.Gitmgurgaon.blogspot.com
Memory section
ym
EEE)
: > mal
‘ Arbitration
“ . . To PEs
Distribution . . « .
* s a
—on]
—— network | * ee
From : =
PEs .
—E
: network
.
.
. * ams
.
| NN
Figure 10.11 The concept of grouping instruction cells into cell blocks.
Processor |
Processor 2
Processor 3
Processor 4
(Result tokens) (Operation packets)
Memory store
Figure 10.12 A 4-processor dataflow computer with 32 cell blocks interconnected by a 32 x 4 arbitration
network ané a4 x 32 distribution network,
752
DATA FLOW COMPUTERS AND VLSI COMPUTATIONS 753
Prete meee
port of the 3 x 2 switch has a buffer. A round-robin scheme can be used to resolve
conflicts among multiple requests destined to the same output ports.
Since the arbitration network has many inputs, a sérial format is appropriate
for packet transfer between instruction cells (or cell blocks) and the arbitration
network to reduce the number of connections needed. However, to achieve a high
rate of packet flow at the output ports, a parallel format is required. For this
WWW.Gitmgurgaon.blogspot.com
754 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Figure 10.14 A 27-by-8 buffered Delta network for resource arbitration in a Delta network.
DATA FLOW COMPUTERS AND VLSI COMPUTATIONS 755
WWW.Gitmgurgaon.blogspot.com
756 COMPUTER ARCHITECTURE ANID PARALLEL PROCESSING
A physical domain
The counter-
The pipelining of rotating token
tokens within ring buses
a physical domain
Token buses x
A “ " aea A
M MG M eee
Giobal bus
shift-register rings. Each ring is partitioned into as many slots as there are PEs
and each slot is either empty or holds one data token. Obviously, the token rings
are used to transfer tagged tokens among the PEs.
Each cluster of PEs (four PEs per cluster, as shown in Figure 10.15) shares a
local memory through a local bus and a memory controller. A global bus is used
to transfer data structures among the local memories. Each PE must accept all
tokens that are sent to it and sort those tokens into groups by activity name. When
all input tokens for an activity have arrived (through tag matching), the PE must
execute that activity. The U-interpreter can help implement iterative or procedure
DATA FLOW COMPUTERS AND VLSI COMPUTATIONS 757
PE
1 I
PE
NXN
network
. *
. CJ
o °
N-1 Nut
PE »
op ne nd
Cc
Constants
(3 om, ary
WWW.Gitmgurgaon.blogspot.com
758 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Input
t
W aiting-matching
section
t
Program and <_—_»- Instruction
data memory fetch section 4
— |
Service section
/ structure
A reer (ALU,
memory
PE mapping)
Y
Output
Figare 10.17 The processing element (PE) in Arvind’s dataflow machine at MIT.
DATA FLOW COMPUTERS AND VLSI COMPUTATIONS 759
either one or two tokens, as indicated by the ar ficld of a token. If the activity
corresponding to the input token requires another token, the waiting-matching
section is informed. The latter has a buffer to hold those tokens for which another
token with a matching tag has not yet arrived, Whenever the tags of two tokens
match, both tokens are moved to the instruction-fetch buffer. Based on the state-
ment number part of the tag, an instruction from the local program memory is
fetched,
Ifa match for the tag of the input token is not found and the waiting-matching
buffer is full, a refusal to accept the token will cause a deadlock. Therefore, if the
buffer-full condition exists, the token has to be stored somewhere else and retrieved
at a later time. After the instruction has been fetched, an operation packet con-
taining the operation code, the operands, and the destinations is formed and sent
to the service section. The service section contains a floating-point ALU and the
hardware to calculate new activity names and destination PE numbers.
The / structure is a special tagged memory for storing arraylike data structures
with constraints on their creation and access. Essentially, an element of an I
structure can be defined only once. A presence bit is associated with every element
of an I structure. An attempt to read an element whose presence bit is not set
causes the read to be deferred. The use of I structures can avoid excessive array
copying. Te service section also precesses the memory operations except I struc-
ture reads. After the ALU or the memory produces the result and the new tags and
destination PE numbers have been computed, the result tokens are sent to the
output section. Since it is possible to encounter delays in transmitting a token
through the communication network, some buffer space is provided in the output
section.
A separate section to hold deferred reads (i.e.. requests to read an element
of an I structure before it has been produced) is needed to avoid the blocking of
the service section. Every unsuccessful read request will be marked and set aside.
An uttusual feature of the PE is that it has no program counter. Instead, it main-
tains a list of enabled activities in the service section and can execute them in any
order. A PE will have 100 percent ALU utilization as long as there is at least one
enabled activity in the service queue at any given time instant.
A group of PEs known as a physical domain is allocated whenever a precedure
ora loop is invoked. All activities of the invoked procedure (or loop} take place
within the physical domain except those activities which are caused by an operator
that changes the context part. The activities of a procedure {loop) can be distri-
buted within a physical domain on the basis of the instruction number of the
iteration number of an activity name. Tag matching may contribute to additional
overhead, which is potentially a performance bottleneck for dynamic data flow
computers.
In order to include the possibility of activating several code blocks (not neces-
sarily distinet from each other) within a physical domain, one can assign a different
color to each activation. However, only a finite number of colors are allowed
within a physical domain and, if all colors of a physical domain are in use, no
new loop or procedure activation can be scheduled on it. Colors are released when
WWW.Gitmgurgaon.blogspot.com
760 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
pe
Memory
r inter- _ Teall 16 PEs
face .
|; Host .
computer
CoH
Broadcast control
Peripheral
3 a
is PE
4 PEs PE, PE,
g
oO
a
8
a
5
PE, PE, PE,y PE,, b
a
Figure 10.18 The EDDY dataflow machine in Japan. (Courtesy of Conf. Proc. 10th Annual Symp.
Computer Architecture, Takakashi and Amamiya, June 1983.)
DATA FLOW COMPUTERS AND VLS1 COMPUTATIONS 761
i . Topragctte cr ssonncacaneecc
rence di Be eee
ene hee nee eee nt Po t : i
'
Instruction memory (1M) Operand memory (OM) i ~~ t
section
secbon section
een
: Functional '
t unit I
i :
it ‘1
anne cence nen ee cuie net uuce Gmmee enene enncenancen saeco sceeeenenee ene 1
{ By i
“ s it |
a i
i' if8 Danes ater y aman weeny
An em ete
i
} Link memory Inter-PE t Operation unit
1 communication {¢ (QV) section
; Link nodes control :
WWW.Gitmgurgaon.blogspot.com
762 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
unique tag is assigned to cach of them. In dynamic tagging, tags are assigned to
each environment dynamically. An execution environment is represented by a
(tag) name.
The environment name and opcode are used as a key for the associative
search, The operation section consists of several functional units including some
number crunchers. The communication section consists of final link memory and
an inter-PE communication controller, This controller sends result packets to
link memory or to other PEs and also receives result packets from other PEs and
transmits them to its own link memory.
The data flow project at Manchester University has included the design of
the high-level, single-assignment programming language Lapse; the implementa-
tion of translators for Lapse and a subset of Pascal; and the production ofa detailed
stimulator for the Manchester compute architecture. Currently the group is
completing a 20-processor data flow computer prototype.
The Manchester machine also assumes a ring structure, as demonstrated in
Figure 10.20. Five functional blocks communicate in clockwise direction around
a ring. A token package is the main unit of information and comprises a data-
value, label, and destination node pointer. The matching unit groups tokens. When
sufficient tokens arrive to fire a node, an appropriate group package finds a
destination node description in the nede store. An executable package [containing
operator, operands, label, and pointer(s) to further destination node(s}] is sent
to the processing unit for execution, The switch handles external input/output.
The token queue saves excess tokens generated at about the same time.
Parallel data-driven rings have been suggested to extend the Manchester
machine (Figure 10.21). Very high processing rates may be achieved by connecting
multiple numbers of pipelined rings in this approach. A unidirectional pipelined
exchange switch is modularly extensible and of relatively simple form as compared
Input
Outpul
~ Switch
unit "
Labeled tokens > Token
queue
+ Node
Processing store -
unit (enabled lag Matching _
(PEs) instruc. unit
at tions}
Figure 10.20 The Manchester datafiow computer organization. (Courtesy of AFIPS Proc. of NCC,
Watson and Gurd, June 1979.)
DATA FLOW COMPUTERS AND VLSI COMPUTATIONS 763
Input Output
ev erento Pernneternimmennnesnnnnsnnmmnsininnteel
ER
~ T
(Labelled ;
tokens) Exchange
switch TQ]
network
* .
Figure 10.21 Multiple ring architecture proposed for the Manchester machine (TQ: Token Queue; MU:
Matching Unit; NS: Node Store; PU; Processing Unit), (Courtesy of Computer Design, Gurd and
Watson, July 1986.)
to a crossbar switch, for example. In such a very large system, the major problem
is to distribute the work load evenly among multiple data flow rings demonstrated
in the figure.
WWW.Gitmgurgaon.blogspot.com
764 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
{Dependence-driven)
ordinary
language program
!
Program normalization
(Data-driven)
dataflow language program
Dependence
!
Dependence
graph generation graph generation
Dependence graph
5
Code generation Code generation
Figure 10.22 Comparison between data-driven and dependence-driven computing models. (Courtesy of
TEER Compurer, Gajski et al., February 1982.)
A second opinion has been expressed on data flow machines and languages
by the dependence-driven researchers. Two questions were raised: First, are data
flow languages marketable? To date, the high-speed computer market has been
dominated by conservatism and software compatibility. Can data flow languages,
as currently proposed, overcome this conservatism? Second, will data flow
languages enhance programmer productivity? Although data flow researchers
have made some claims to this effect, they remain unsubstantiated.
PDependence-driven researchers felt that, in small-scaie parallel systems, data
flow principles have been successfully demonstrated. When simultaneity is low,
irregular, and run time-dependent, data flow might be the architecture of choice.
In very large-scale parallel systems, data flow principles still show some potential
WWW.Gitmgurgaon.blogspot.com
766 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
prsenees =f
\ Global controller
1
1 oT aoe of
St
1
it
i
' 1
tt t
roG
! : Processor Processor ets Processor
i 3 cluster cluster cluster
t t
‘ i
tf }
t { 4
1 ot
Poy t t
ty
bbe aped interconnection network
i 4 , ‘ 4
‘ eet
‘ i r t T
'‘
i
be nn ed Shared memory space
Control
Figure 10.23 The hardware structure suggested for high-level data flow computing, for either the
dependence-driven model or the event-driven model,
for high-level control. It is in medium-scale parallel systems that data flow has
little chance of success. Pipelined, paralicl, and multiprocessor systems are all
effective in this range. For data flow processing to become established here, its
inherent inefficiencies must be overcome.
yf Job level
4
Y :
i Compound function level
: Procedure level
h
(Engrossments) (Abstractions)
Task level
Subtask level
t Instruction level
Figure 10.24 Multitevel program abstraction in the event-driven data Bow computing model,
1. The design ofa machine instruction set and high-level data flow languages
2. The design of the packet communication networks for resource arbitration and
for token distribution
3. The development of data flow processing elements and the structured memory
4. The development of activity control mecharusms and data flow operating
system functions
WWW.Gitmgurgaon.blogspot.com
768 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Simplicity and regularity Cost effectiveness has always been a major concern in
designing special-purpose VLSI systems; their cost must be low enough to justify
their limited applicability. Special-purpose design costs can be reduced by the
use of appropriate architectures. If a structure can truly be decomposed into a
few types of building blocks which are used repetitively with simple interfaces,
great savings can be achieved. This is especially true for VLSI designs where a
single chip comprises hundreds of thousands of identical components. To cope
with that complexity, simple and regular designs are essential. VLSI systems based
on simple, regular layout are likely to be modular and adjustable to various
performance levels.
WWW.Gitmgurgaon.blogspot.com
77@ COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
basic principle of systolic architectures and explains why they should result in
cost-effective, high-performance, special-purpose systems for a wide range of
potential applications.
A systolic system consists of a set of interconnected cells, each capable of
performing some simple operation. Because simple, regular communication and
control structures have substantial advantages over complicated ones in design
and implementation, cells in a systolic system are typically interconnected to
form a systolic array or a systolic tree. Information in a systolic system flows
between cells in a pipelined fashion, and communication with the outside world
occurs only at the “boundary” cells. For example, in a systolic array, only those
cells on the array boundaries may be I/O ports for the system.
The basic principle ofa systolic array is illustrated in Figure 10.25. By replacing
a single processing element with an array of PEs, a higher computation throughput
can be achieved without increasing memory bandwidth. The function of the
memory in the diagram is analogous to that of the heart; it “‘ pulses’ data through
the array of PEs. The crux of this approach is to ensure that once a data item is
brought out from the memory it can be used effectively at each cell it passes. This
is possible for a wide class of compute-bound computations where multiple
operations are performed on cach data item in a repetitive manner.
Suppose each PE in Figure 10.25 operates with a clock period of 100 ns. The
conyentional memory-processor organization in Figure 10.25a has at most a
performance of 5 million operations per second. With the same clock rate, the
systolic array will result in 30 MOPS performance. This gain in processing speed
can also be justified with the fact that the number of pipeline stages has been
increased six times in Figure 10.254. Being able to use each input data item a
number of times is just one of the many advantages of the systolic approach.
Other advantages include modular expansionability, simple and regular data and
Memory
~) PE
Memory |
control flows, use of simple and uniform cells, elimination of global broadcasting,
limited fan-in and fast response time.
Basic processing cells used in the construction of systolic arithmetic arrays
are the additive muttiply cells specified in Figure 3.29. This cell has the three inputs
a,b,c, and the three outputs a = a, 6 = 6, and d = ¢ + a* db. One can assume
six interface registers are attached at the 1/O ports of a processing cell, All registers
are clocked for synchronous transfer of data among adjacent cells. The additive-
multiply operation is needed in performing the inner product of two vectors,
matrix-matrix multiplication, matrix inversion, and L-U decomposition of a
dense matrix.
Hlustrated below is the construction of a systolic array for the multiplication
of two banded matrices. An example of band matrix multiplication is shown in
Figure 10.26a. Matrix A has a bandwidth (3 + 2) ~ 1 =4 and matrix B has a
bandwidth (2 + 3)— 1 =4 along their principal diagonals. The product matrix
C = A-Bthen has a bandwidth (4 + 4) — 1 = 7 along its principal diagonal. Note
that all three matrices have dimension » x #, as shown by the dotted entries. The
matrix of bandwidth w may have w diagonals that are not ail zeros. The entries
outside the diagonal band are all zeros.
[It requires w, x wy processing cells to form a systolic array for the multi-
plication of two sparse matrices of bandwidths w, and w., respectively. The
resulting product matrix has a bandwidth of w, + w,-~ 1. For this exampie,
wy, xX wy = 4x 4= 16 multiply cells are needed to construct the systolic array
shown in Figure 10.260. It should be noted that the size of the array is determined
by the bandwidths w, and w,, independent of the dimension n x n of the matrices.
Data flows in this diamond-shaped systolic array are indicated by the arrows
among the processing cells.
The elements of A = (a,,) and B = (4,,) matrices enter the array along the
two diagonal data streams. The initial values of C = (¢,;) entries are zeros. The
outputs at the top of the vertical data stream give the product matrix. Three data
streams flow through the array in a pipelined fashion. Let the time delay of each
processing cell be one unit time. This systolic array can finish the band matrix
multiplication in7' time units, where
T = 34 + min(w,,w) (10.3)
Therefore, the computation time 1s linearly proportional to the dimension a of the
matrix. When the matrix bandwidths increase tow, = w, = #1 (for dense matrices
A and B), the time becomes O(4n), neglecting the 1/O time delays. If one used a
single additive-multiply processor to perform the same matrix multiplication,
O(n?) computation time would be needed. The systolic multiplier thus has a speed
gain of O(n’). For large n, this improvement in speed is rather impressive.
VLSI systolic arrays can assume many different structures for different
compute-bound algorithms. Figure 10.27 shows various systolic array configu-
rations and their potential usage in performing those computations ts listed in Table
10.1. These computations form the basis of signal and image processing, matrix
arithmetic, combinatorial, database algorithms. Due to their simplicity and
WWW.Gitmgurgaon.blogspot.com
a
Ne
ay
Ff
u
G
no
2
=
cS
OO ws
ay ay2 aa,
co
a
on
S
3
&
ro
woe
—
ss
aot
a
es
mo
x2
s
=
:
6 G
o8 +
=
.
z
=a
°
.
*
*
o
2
{a} Band matrix multiplication
Ss
Cc dfoul=¢
te
in
+4, dy,
G
in each processing cell
ema ntec
Sa
ene ety,
ws,
seneene sent
Sp
. se
we
ary
23
J
oa
———-
ly
aa x4
wua
772
DATA FLOW COMPUTERS AND VLSI COMPUTATIONS 773
ee
{a@) One-dimensional linear array
| | _t
a 4 IL
J +
(d) Binary tree
{e} Triangular array
WWW.Gitmgurgaon.blogspot.com
774 COMPUTER ARCHITECTURE ANT? PARALLEL PROCESSING
1-D linear arrays Fir-filter, convolution, discrete Fourier transform (DFT), solution of
triangular linear systems, carry pipelining, cartesian product, odd-
even transportation sert, real-time priority qucue, pipeline arithmetic
units.
2-D square arrays Dynamic programming for optimal parenthesization, graph algorithms
involving adjacency matrices.
2-D hexagonal arrays Matrix arithmetic (matrix multiplication, L-L’ decomposition by Gaus-
sian climipation without piveting, QR-factorization). transitive
closure, pattern match, DFT, relational database operatinns.
Trees Searching algorithms (queries On nearest neighbor. rank, elc,, systohe
search tree), parallel function evaluation, recurrence evaluation,
For large # (say # 2 1000) with typical operand width w = 32 bits, it is rather
impractical to fabricate an » x n systolic array on a monolithic chip with over
4n x w= 12,000 1/O terminals. Of course, [/O port sharing and time-division
multiplexing can be used to alleviate the problem, But still, 1/O is the bottleneck.
Until the I/O problem can be satisfactorily solved, the systolic arrays can be only
constructed in small sizes. The modular VLSI approach to be described in Section
10.4 offers an alternative to overcome this difficulty,
M M eee M
interconnection network
~
Processor Processor
the network geometry, F is the cell function, and 7 is the network timing. These
features are described below separately.
The network geometry G refers to the geometrical layout of the network. The
position of each processing cell in the plane is described by its Cartesian coor-
dinates. Then, the interconnection between cells can easily be described by the
position of the terminal cells. These interconnections support the flow of data
through the network; a link can be dedicated only to one data stream of variables
or it can be used for the transport of several data streams at different time instances.
A simple and regular geometry is desired to uphold local communications.
The functions F associated to each processing cell represent the totality of
arithmetic and logic expressions that a cell is capable of performing. We assume
that each cell consists of a small number of registers, an ALU. and control logic.
Several different types of processing cells may coexist in the same network; how-
ever, one design goal should be the reduction of the number of cell types used.
The netwark timing T specifies for each cell the time when the processing of
functions F occurs and when the data communications take place. A correct (ming
assures that the right data reaches the right place at the right time. The speed of
the data streams through the network is given by the ratio between the distance of
the communication link over the communication time. Networks with constant
data speeds are preferable because they require a simpler control logic.
The basic structural features of an algorithm are dictated by the data and
control dependencies. These dependencies refer to precedence relations of
WWW.Gitmgurgaon.blogspot.com
776 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
UU) (10.7)
car uy,
end
for i-ek+1 until n—-1 do
for j—k+1 until n—-1 do
4: begin
Heelies
uu
akeapo'—Hptuy!
end
end
WWW.Gitmgurgaon.blogspot.com
778 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
The data dependencies for this three-loop algorithm have the nice property
that
d, = (1, 0, 0)F
d, = (0, 1, 0)" (10.8)
d, = (0, 0, 1
We write the above in matrix form D = [d,, d,, d,] = I. There are several other
algorithms which lead to these simple data dependencies, and they were among
the first to be considered for the VLSI implementation.
The next step is to identify a linear transformation T to modify the data depen-
dencies to be T-D = A, where A = [6,,5,. 53] represents the modified data
dependencies in the new index space, which is selected a priori. This transformation
T must offer the maximum concurrency by minimizing data dependencies and
T is a bijection, A large number of choices exist, each leading to a different array
geometry. We choose the following one:
l l l
DD
=
T=1]0 ] O| such that =T-li (10.9)
ee
0 0 l ee
The original indices &, i,j are being transformed by T into f, 7,? The organization
of the VLSI array for # = 5 generated by this T transformation is shown in Figure
10.29.
In this architecture, variables aj; do not travel in space, but are updated in
time. Variables /t, move along the direction 7 (east with a speed of one grid per
Figure 10.29 A square systolic array for L-U decomposition (2 = 5) in Example 10.3. (Courtesy of IEEE
Proceedings, Moldovan, January 1983.)
DATA FLOW COMPUTERS AND VLSI COMPUTATIONS 779
time unit), and variables ui, move along the direction 7 (south) with the same
speed, The network is loaded initially with the coefficients of A, and at the end
the cells below the diagonal contain L and the cells above the diagonal contain U.
The processing time of this square array is 3n — 5. All the cells have the same
architecture. However, their functions at one given moment may differ. It can be
seen from the program in statement (10.7) that some cells may execute loop four,
while others execute loops two or three. If we wish to assign the same loops only
to specific cells, then the mapping must be changed accordingly.
For example, the following transformation:
introduces a new data communication link between cells toward north-west. These
new links will support the movement of variables aj, According to this new
transformation, the cells of the first row always compute loop two, the cells of
the first column compute loop three, and the rest compute loop four. The reader
can now easily identify some other valid transformations which will lead to differ-
ent array organizations.
The design of algorithmically specialized VLSI devices is at its beginning. The
development of specialized devices to replace mathematical software is feasible
but still very costly, Several important technical issues remain unresolved and
deserve further investigation. Some of these are: [{O communication in VLSI
technology, partitioning of algorithms to maintain their numerical stability, and
minimization of the communication among computational blocks.
WWW.Gitmgurgaon.blogspot.com
(a) Mesh for dynamic
programming
x (b) Hexagonally connected
mesh for L-U decomposi- tion
ee
(c) Torus for
transitive
closure
(d) Binary tree for sorting
780
DATA FLOW COMPUTERS AND VLSI COMPUTATIONS 781
« m--the number of wires entering a switch on one data path (path width)
« d--the degree of incident data paths to a switch
« c-~the number of configuration settings that can be stored in a switch
e g--the number of distinct data-path groups that a switch can connect
simultaneously.
The value of m reflects the balance struck between parallel and serial data trans-
mission. This balance will be influenced by several considerations, one of which
is the limited number of pins on the package. Specifically, ifa chip hosts a square
region of the lattice containing x PEs, then the number of pins required is propor-
tional to m/n.
The value of d@ will usually be four, as in Figure 10.31a, or eight, as in Figure
10.3] c, Figure 10.316 shows a mixed strategy that exploits the tendency of switches
WWW.Gitmgurgaon.blogspot.com
a
rs ras
: eee:
Sno
tO rd
PT
r
X.
_i“
&a N
abdZAIN IN
a K
LY
Ws ASP L/
x 1
AT
SSI
Cre
,
LES DXIXIXIXIXIXIX!
A
eRer c)
Q SBeaR crt}
x OR .) Figure 10.31 Three switch lattice structures (circles
rx EXEX CXLXEXLTXLX 1 represent switches and squares represent PEs).
Jstestos ls ERED (Courtesy of EEE Compater, Snyder, January
(c} 1982.)
DATA FLOW COMPUTERS AND VLSI COMPUTATIONS 783
OOOO
O00 00 0
O LRO sob.
OOO O~ oO
OM © O
O ©
Root ©
QO O
O © O
OO 00 0 0 OO OC C Qiaure 10.32 Switch lattice configura-
ation settings of the structure in Figure
(b) The switch lattice of Figure 10.31@ 10.3la. (Courtesy of IEEE Computer,
configured into a binary tree Snyder, January 1982.)
to be used in two different roles. Switches at the intersection of the vertical and
horizontal switch corridors tend to perform most of the routing, while those
interposed between two adjacent PEs act more like extended PE ports for selecting
data paths from the “corridor buses.” The value of ¢ is influenced by the number
of configurations that may be needed for a multiphase computation and the
number of bits required per setting.
WWW.Gitmgurgaon.blogspot.com
784 COMPUTER ARCBITECTURE AND PARALLEL PROCESSING
(a) Bipartite
graph
WWW.Gitmgurgaon.blogspot.com
786 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
ooooooococoococoocoocoooco
ooo ooo oo ocd
L) (30 T) Oo [}-O-{}-0 mMonononewtfooodteoen Of }-O-( J
WoOOo rooOe0o¢
oo fo O00 ooo
0009
0 9 00009095 Ore O roo
O06 00 O O-0
0 O00 O
of}
0 0 O-LO-1] Ot Ont CHO CrO-L}-0 (0
Oo 0-0-0 oO
of O-f-O-{-O-1.--9 (}0-11-0-f7 6 0 9 CO }0-( 1) L) o-t}o-0-o-()}--)]
ogo oo O o000009 ©O0 O Oo OO o
o0 00 (] (FO
(J 0 O-0 OOo [Oo CO-}-O-1)
08 (0
90600006000 O-O- Or) 090000 000
ogo Q (rOo-{] O-+] of} OO o7D0 00 teo-O°o- 01 9 eo fo oof
oo
0 O<O OoO60 9 O09 Ooo 0 O
© CkO-D () O-[}O [)-< [06 (}+o-0]
o O-o Letbe O
oOo 0 GCO0O000 oO CG O-Oo O0000 0 0
010 OL O-L-O { <) C3 o GC} o [oO-fo (J 0 Cro-C}0 (JoQ
oo°o ° {> O00 6 ° Oo ¢ 0009 0 oo0°0
O O (ie
fF) o Oot) (iodoU of} o 0 o 0o-(}o-
o B
ooooocooc
oo ocoo oo Oo oO oc oO co coo COG oO 6
Spare
Figure 10.34 Planar embedding of a 255-nede binary tree into the switch lattice of Figure 10.31@ (the
root of the tree is at the center of the lateice). (Courtesy of IEEE Computer, Snyder, January 1982.)
DATA FLOW COMPUTERS AND VLSI COMPUTATIONS 787
ry Driver
Figure 10.35 Layout of a Wafer Scale Integrated (WSI) processor array. (Courtesy of K. S. Hedlund,
Ph.D. Thesis, Purdue University, June 1982).
WWW.Gitmgurgaon.blogspot.com
788 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
4, 4%, Ny 1 a 0 Hyp My
@ % My} = [h, 1 O)s) 0 dy tay
4%, yy yy bi 4a! O 0 hy
(a) Submatrix decomposition module
Oey yg | = | OV Vay
0 0 ua 33 0 0 ¥,
\
Ay... Ay A,
(M module} pe, D
BL... By By tm
P
D=C+ BAB, where C, D, and
i=
(4, and B, for f= 1,....p) are moc
matrices.
}
AL... A}
P (F module) ae
by oe by —~
P
d=ct ¥ Ab, where c, d,
‘a
10, for t= Ly...pi are mx 1
column vectors, and {A, for /=1,...,21
Figure 10.37 VLSI arithmetic modules
\ ; Cc f
are m X mt matrices. for matrix computations. (Courtesy o!
fEEE Trans. Computers, Hwang and
(d@) Matrix-vector multiplier Cheng, December 1982.)
WWW.Gitmgurgaon.blogspot.com 789
790 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
79%
tp 4
%
Japp :q tp fp tty “) ] 0 sinduy
yoydnnu yy ¥, @p) =¥
wxaduinusp ixwq 7s ap %y py Tp ty
daxadimi =X dW yy
yorey I¥y Tey lip Fly fly Up Wy 4
q
XING Xd
WWW.Gitmgurgaon.blogspot.com
Xd
KdW XdiN
B
3 @ Py 8,
ty vey 4
ago fy 9
4
Fe a, uty ly ty, §, Ssindino
zw, n ’!
q —q 1% % % Tin fy Uy iy
ln &
792 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
b 6
Latch Inputs
a q ont
d=axb+e
M a itt
e
Q L
f f
M M beater U4 Hyg
g= -c/f o-
M M M at a4 Hoy Hyp
1
4 4‘ b
D D mae yy H22 Hy
L hb & te kG
y Y 1 L
My M2 “3 “Me
5
Yaa Vaq fa
4
Vag 4
Figure 10.39 VLSI computing module for the inversion of a triangular matrix. (Courtesy of /EEE Prac.
of Stk Symp. on Computer Avithmetic, Hwang and Cheng, May 1981.)
Ly = Ag UT! forp
= 2,3,...,k
forg = 2,3,...,k (10.11)
Ui, = Ly Aig
DATA FLOW COMPUTERS AND VLS] COMPUTATIONS 793
od a 3a * enmeennrertectercata
[ort eerceaben nm eracfen‘ sects
i Atl 2 pi iy 1 +1]
By Of) eee OF OF BD by t ad
a, > it } d tt
t 1
A of) 2 Qh Al) old :i i'
Sr Ti #88 Sy Sy a Sy > t :
Dineen enn cane nee caccee ee eee ace ene nenee meen orenad
pocrerectcrteentmecncactncnecntannenceecarng
etme
: j‘
2 > : " AMU ee‘ Fi)
cc
2 2 wwe oe 2 el)OR oll > é: :a
: :
Dann we tee ee ee eee oe een eee ees em emt en eee nell
i pt 2) pl pl )
bY OG eee BE Be BL by
5, >
ay >
AMU: — additive
multipty unit
(ar Aw ©
ay dy a) Ap 1 Poi bf on SD
= com ‘
pe A At) ple)
dy doy A Ax rl BY BS So SH
ro XK: control
aay — 5, OF) + ofp + bY) + cP L: tatch
‘ MPX; multiplexer
di=a,— 7, bY) + of + BY) 6 eG DMX: demultiplexer
r
~ _ © pli aAt) a >
Ay dy 2, Oo ey + Bt ey
,
= -~ © av) Ur} Ghoy to
Ay =dy— FON +e + Bt cy b
a a—bec
c
Figure 10.40 VLSI arithmetic module for the multiplication of two sequences of 2 x 2 matrices. (Courtesy
of JEEE Proc. of Sth Symposiam af Computer Arithmetic, Hwang and Cheng, May 1981.)
WWW.Gitmgurgaon.blogspot.com
794 COMPUTER ARCHITECTURL AND PARALLEL PROCESSING
kos nfm
Columns
Step 3 Step 4
* (k-2) ons
submatrices
2m ~
n= kem Z ‘‘
XN
LW. All squares are mm submatrices.
The remaining off-diagonal submatrices L,, and U,, are computed by inverting
the diagonal submatrices U,, and L,, and then multiplying them by intermediate
submatrices Ay and Arg at successively even-numbered steps. For r = 2,3,...,k,
we compute
Ly = A, U;! forp=r41,...,k
opel . 9.14)
Ug
= La + Arg forg=r+i,...,k
Example 10.4
The above interactive procedures are summarized in Figure 10.42 for parti-
tioned L-U decomposition of any groups of any nonsingular dense matrix A of
WWW.Gitmgurgaon.blogspot.com
796 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
aod
Compute Ayy = Agg mY Legg Vag
amd
Forp — (¢ + f)to
X step ido
wrod
Compute
" on
Aig
4
== Any a
x Ligg U gs .
adi
s=
rel
Example 10.5
Uy Uy U1; Oh4 7
Urls
0 Ux, Us3 Us;
0 0 Us3 Usg
0 0 0 Ua,
Fir Via Vis Via
Triangular linear system solver The VLSI solution of a triangular linear system of
equations can be done in A back-substitution steps. In the following example, x;
and b,fori=1,2,...,k are m x | subvectors and U, are m x m submatrices:
Example 10.6
WWW.Gitmgurgaon.blogspot.com
798 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
In general, k = v/m steps are needed in the back substitution. Matrix U is parti-
tioned into k x ({k + 1)/2 submatrices of order m x each. The solution vector
x is divided into & subvectors.
b Ly latch
te]
. M
Us; Uy Uy _
op 23!
Ly
External ,
input e t,
f
—o>|
om U,,
Pa
Uy,
Step | | Ay,
Ube ~
Step 2 Ay | Uy Or | 4g] 4] en] Ly] As ne
Ug} Lyp
Figure 10.43 Arithmetic pipeline for partitioned L-U decomposition (Exainple #0.4). (Courtesy of
Computer Vision, Graphics, and Image Processing, Hwang and Su, 1983a.)
WWW.Gitmgurgaon.blogspot.com
800 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Uy ete Hs
er
pail >
L
pot
By eT A oe
M -f < R
22 0
or] V.
22 } Wy {
/ M | Fig My
Uy;
| Vy, ae a
t VY {
¥ Vig Moa ¥.
Dg Ui G2 Mag Vas ya M -f 2 »| ror 3
Uy,
er
Vag Note: All Uy, Vi, are m Xm submatrices
defined in the example.
Figure 10.44 A pipelined V LS] matrix inverter (Example 10.5). (Courtesy of Computer Vision, Graphics
and Image Processing, Hwang and Su, 1983a.)
One modular
step °
f 1 f 1 t 1 |
A 3 A,, ne A,, a A,, A 4, i AAy A Ay Ay, a A A t qd if '
> Mfr Cy, 12 ul
oO
As, aa Ax, on Ay, an 45, Gas Ay, a2 Ay, ar Any aa AD ar A Sar > {
M pom Co; Cy Cy
oO
L
d
v nn
c
‘t €
Vv f
|eaierrernfiirterer
h
fee it
v i
Leet
4
input terminais Outputs
Steps a b c d ée f g A i i kk
Stepi | Uy | Ug) | GQ Xy
y Uyy
w | Us
HY Ui
Figure 16.46 A VLSI triangular system solver (Example 10.6). (Courtesy of Computer Vision, Graphics,
and Image Processing, Hwang and Su, 1983a.)
WWW.Gitmgurgaon.blogspot.com
802 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
where O(n’) is the compute time of using a uniprocessor to solve triangular system
of algebraic equations.
Itis obvious that trade-offs exist between the module counts and time delays of
all partitioned matrix algorithms. By presetting a speed requirement, one can
decide the minimum number of VLSI modules needed to achieve the desired speed
performance. On the other hand, one can predict the speed performance under
prespecified hardware allowance. This trade-off study is necessary for cost-effective
design of large-scale matrix system solvers.
DATA FLOW COMPUTERS AND VLSI COMPUTATIONS 803
Serial-parallel architecture
Strictly paratlel architectures with bounded number
wilh rinimum time delays of VLSI modules
Total Total
Matrix algorithm VLSI module count compule VLS] module count — compile
and module types time and module types lime
Note: All measures are based on the assumption a & wm & 1, where # is the matrix order and m
is he VLS1 module size.
* Predominating VLSI modules to be used.
VLSI feature extraction Foley and Sammon (1975) have introduced a discrimi
nating method to generate an optimal set of transformation vectors based on
maximum separability instead of best picture fitting. The Foley-Sammon algo-
rithm is modified below to allow modular implementation ofa feature extractor
by the proposed VLSI hardware.
Let #, be the number of training samples and m, be the sample mean for class s.
The sample offset z\° is a column vector formed by the following vector subtrac-
tion:
z)(s) =_ x yf) — m, ¥ —
forj=1,2,...,H s
WWW.Gitmgurgaon.blogspot.com
804 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
i
Feature +) | Pattern » Class
extraction / classification label
Feature
space
Input
space
Figure 10.47 A statistical pattern recognition system.
The sample offset matrix Z, isann x , matrix formed by Z, = [2)?, 2)... . 22").
A within-class scatter matrix S, is obtained by performing an orthogonal matrix
multiplication, where S, isan” x ” matrix:
S,=Z,-Z2 (10.18)
A Weighted scatter matrix A is defined below for classes s and 7. The fraction
C, where 0 < C < 1, is determined by a generalized Fisher criterion:
A=C-S40-0©)-S, (10.19)
Let m = m, ~ m, be the mean difference. We define an A x A matrix B, = (4,,)
for A= 1, 2,..., m— 1, where 6, =d?.A7'.d, for | <i, js kh. Foley-
Sammon algorithm is summarized below:
t
‘
, t
'
! oe-| Matrix 5, f eeneeaee
nn sen ene samen nteeeae
! multiply .¥ it Matrix inverter
! ’ network — tt
Training__L| Vector fet 2 Weighted | 1a) bel
samples — subtractor Zz mee rt decomposition
" adder ' networ
I Matrix it
xj I| ;
multiply S, i
i} |
i network TT ft ¥
Ii rot
DB! bt , :
t m ot {t | Triangular Triangular
La ana wepawabannneneroe
ee seen p--f--f---F |matrix matrix
{| inverter inverter
Pattern tp yo
vectors ‘Yr i Y t
t ; t .
Feature | Matrix
a - a Transformer
é ito = 1 M tr
alrix
ce vector —<t vector “tt multiply
vectors on d es a a ay
| multiplier é?| generator a: 7 network
a L
t « roe
j Feature generator de it
Figure 10:48 The schematic design of a VLSI pattern classifier. (Courtesy of Computer Vision, Graphics,
and Image Processing, Hwang and Su, November 1983.)
WWW.Gitmgurgaon.blogspot.com
806 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
VLSI pattern classification Linear discriminant functions can be used for pattern
classification. To distinguish amongp pattern classes, p(p ~ 1)/2 pairwise discrim-
inant functions are needed. A linear discriminant function between two distinct
classes is defined by:
Fly) = voy +4 (10.22)
where y = (¥1,¥2,---+. Ym)’is the feature vector, ¥ = (v1, 02,..., t,)" is called
the discriminant vector and « is a scalar threshold constant. The discriminant
vector ¥ and threshold constant « are determined with the aid of a set E of training
features with known class labels. The feature pattern y is classified into class s if
F(y) = 0, and into class ¢, if otherwise.
Fisher’s method is used to generate the discriminant vector v from training
features, The threshold constant « can be set to “zero” with an appropriate
choice of the coordinate system. Let y be the jth training feature vector of class
s, for j = 1, 2,...,m,. The feature mean difference is f = £, ~ f, and feature
offset vector is wi = y/)—f,. An mx m, feature offset matrix W, =
[wi?, wo, ..., wh] is defined for each class. The covariance matrix for class s is
computed by:
+= Wow (10.23)
m,— 1
where >, has dimension m x m. The following computation steps are needed
to generate the discriminant vectors needed for pattern classification.
A linear pattern classification algorithm
1. Compute f = f, — f, and the feature offset matrices W, by subtracting the
mean f, from each training feature yf) for j = 1,2... ., ms.
2. Generate matrices x, and >, using Eq. 10.23.
3. Solve the following linear system of equations to determine the discriminant
vector v:
(FY + ¥)-e=f (10.24)
Functional design of a VLSI pattern classifier based on the above algorithm
is sketched in Figure 10.49, The schematic design of the covariance matrix gener-
ator is similar to the scatter matrix generator in Figure 10.48. The linear system
solver is composed of an L-U decomposition network (Figure 10.43) and a
triangular system solver (Figure 10.46). This matrix solver is needed to solve the
dense system specified in Eq. 10.24. The Fisher classifier is essentially a threshold
decision unit (Eq. 10.22) which can be easily implemented by some combinational
logic circuits and the modified V modules. The VLSI approaches to both feature
extraction and pattern classification can result in significant speedups of
O(n) ae O(n)
enka or wt? (10.18)
On? fm) O(n)
This will make it possible to achieve real-time image processing.
Feature extraction and pattern classification are initial candidates for pos-
sible VLSI implementation. VLSI computing structures have been proposed for
DATA FLOW COMPUTERS AND? VLSI COMPUTATIONS 807
Class fabels
Figure 10.49 The schematic of a VLSI feature extractor. (Courtesy of Computer Vision, Graphics, and
fmage Processing, Hwang and Su, November 1983.)
Manchester machine is reported in Gurd and Watson (1980), The MIT dynamic
data flow project has been described in Arvind et al. (1980) and Arvind (1983). The
EDDY system is described in Takahashi and Amamiya (1983). The French LAU
system is described in Syre et al. (1977, 1980). The Utah machine is described in
Davis (1978). The Newcastle control data flow machine is described in Treleaven
(1978), The dependence-driven approach was proposed by Gajski et al. (1981). The
event-driven approach using priority queues was suggested by Hwang and Su
(1983c). Packet switching networks for dataflow multiprocessors were treated in
Dias and Jump (1981) and in Chin and Hwang (1983).
Systolic array for VLSI computation was suggested by Kung and Leiserson
(1978). A review of the systolic architecture can be found in Kung (1982), An
assessment of VLSI for highly parallel computing is given by Fairburn (1982).
The material on mapping algorithms into VLSI arrays is based on the work by
Moldovan (1983). Reconfigurable processor-switch lattices are based on the work
of Snyder (1982). Water scale integration of the switch lattice is studied in Hedlund’s
Thesis (1982). Partitioned matrix algorithms and VLSI image processing structures
are based on the work by Hwang and Cheng (1982) and by Hwang and Su (1983a).
A wavefront appreach to designing cellular processor arrays can be found in
S. Y. Kung et al. (1982), The 3-dimensional VLSE architecture was treated in
Grinberg et al (1984).
Problems
10,1 Describe the following terms associated with data flow computers and languages:
(a) Contral flow computers
(6) Data-driven computations
(ce) Static data flow computers
(d) Dynamic data flow computers
(¢) Data flow graphs and languages
() Single-assignment rule
{g)} Unfalding of iterative computations
(h} Freedom from side effects
(} Dependence-driven computation
() Coloring technique
(k) Event-driven computation
10.2 Draw data flow graphs 10 represent the following computations:
10.3 Tt is desired to construct a packet-switched arbitration network [or a stauc data flow computer
similar to the Dennis machine at MIT. Use 5 = 2 switch boxes as building blocks to construct the
network, A unique addressing path is demanded in the network,
(a) Design the $ « 2 switch box with multiplexers, demultiplexers, and buffers. Switch control
mechanism should be explained.
(b) Construct a 33 x 2 buffered Detia network with the 3 x 2 switch boxes to be used for
arbitration purpose. Show all the interconnections hetween stages.
(c) There are 243 possible states ofa typical 5 x 2 switch box. Each input port sends one request
to be connected to one of the wo output ports or none. Assuming that all states are equally probable,
derive the blocking probability of a typical § x 2 switch box. When the switch is in a nonblocking
state, all requests fram input ports are cannected to distinct output ports without conflicts. Whenever
two or more inpul tequests are destined to the same outpul port, the switch is entering a blocking
slate. The blocking probability indicates the chance that a switch may be blocked.
(d) Repeat the same question in part (4) for a 2 x 2 switch box. Compare the blocking proba-
bilities between 8 x Zand 2 x 2 switch boxes. Which one has higher blocking probability?
(e) Based on the blocking probability found in part (¢), derive an expression lo represent the
network blocking probability, if requests from all 125 input ports are equally probable and their
destinalion distribution is uniform among the eight output ports.
10.4 (a) Use 8 x 8 switch boxes lo construct a 64 « 64 routing network for the Arvind machine
with 64 PEs, Label all the input-output ports and show al) the interconnections among the & x 8 boxes.
(by) Show haw io join two 64 x 64 networks to form a network of size 112 x 112, Some cutputs
af one network can be connected 10 some inputs of the network in the joining process.
(c) Suppose that each 8 x 8 switch box has a delay of D. Analyze the delays af the 64 » 64
network and of the 112 « #12 network separately, under no blocking conditions.
16.5 Given a sequence of weights {t2,, 0... ., c,} and an input sequence of signals {x,.x%2..-.. Xhs
design lwo linear systolic arrays with & processing cells 10 solve the convolution problem
(a) In the first design, you are given the unidirectional cells which compute Jour C Jin + OMe
as shown in Figure 10,50¢. Explain your design, the distribution of the inputs, and the systolic flaw
of the partial results y,’s fram feft to right.
(6) In the second design, you are given the bidirectional cells which compule Jou Sin +
eX, and Xo — igs a8 shown in Figure 10,50b. Explain the design and operation of this systolic
convolution array.
JX
In
¥ in _ ¥ gue
¥ ma Four Fin t Xin
a Fout Yin
_ You Min + Wexn
WWW.Gitmgurgaon.blogspot.com
810 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Let rg. ty... s My be the dimensions of the # matrices with r;_, and r, dimensions of Mf,. Denote by
om, the minimum cost of computing the product M,-Mf;,,; °-- M4;. The algorithm which produces
m,, is given below:
Following the mapping procedure in Section 10.3.2, transform the above algorithm into a suitable
form which cun be implemented by the triangular array shown in Figure £0.50. All the processing
cells perform the same functions to be defined tn your transformed algonthm,
| éfint: One possible transformation of the indices T: (4, i &) > (hi A) is piven by fs: max(2/ + i k
Pope k + ha de krand& = —k)),
16.8 Develop a cellular array processor for implementing a tridiagonal linear system solver, Specify
the cell functions. the VLS1 array structure, and the data flow patterns in the cellular array. You have
freedom in choosing either a global systole approach or a modular pipelined approach based on
“block” partitioning and back substitution, Comment on the speed and hardware complexities in
your design,
10.9 Consider the program graph shown in Figure 10.51, The critical path is a A.cy, 20.00, Cg, Which
Tesults ina flower bound on execution of 13 time units, assuming that division takes 3 Gime units, mulu-
plication 2, and addition |. A hypothetical data flow computer has four processing units, cach capable
of executing any function. We idealize the machine by assuming that memory and interconnection
delays are zero. In cach of the following machine utilizalions, show the schedule (a time-space diagram
similar lo that in Figure 3.46) of the 24 computing events (8 divisions, 8 mudtipticarions, and 8 additions)
and indicate the tolal execution lime and utilization rate of processors,
ta) Use only one processor to perform sequential execution, one operation ata time.
(b) Use three processors lo perform static dala flow computations with a one-token-per-areli
policy. Note that the three processors can be pipelined to execute a block of statements inside the loop.
(c) Use four processors to perform dynamic data flow computations such that tokens are labeled
and logical events are colored to share the available resources.
DATA FLOW COMPUTERS AND VLSI COMPUTATIONS 811
input def
Cf = 90
for i from 1 to 8 do
begin
a = dj +e;
b;J =a, * f
eq = bp + ey
end
output a,b,c
Figure 10.51 A program graph for Prob, 10.9. (Courtesy of IEEE Computer, Gajski, et al., Feb, 1982.)
WWW.Gitmgurgaon.blogspot.com
BIBLIOGRAPHY
Ackerman, W. B., and Dennis, J. B., “VAL—A Value-Oriented Algorithmic Language,” FR-2/8,
Lab, for Computer Science, MIT, June 1979,
Agerwala. T.. Some Extended Semaphore Primitives.” Acta informatica 8 Springer Verlag, 1977.
Agrawal, D. P.. ‘Graph Theoretical Analysis and Design of Multistage [ntercannectian Networks,”
{EEE Trans. on Comp., C-32, July 1983, pp. 637-648.
Algiere, J. L., and Hwang, K.. "Sparse Matrix Techniques for Circuit Analysis on the Cyber 205,”
TR-EE 83-40, Purdue University, W. Lafayette, Indiana, October 1983.
Amdahl Corp., Amdah! 470 V6 Muchine Reference Manual, Sunnyvate, California, 1975,
Amdahl, G. M., Blaauw, G. A., and Boorks, F. J., Jr. “Architecture of the IBM System/360.” fBAP
Journ. of Res. and Der, vol. 8, no. 2, 1964, pp. S7-EOT.
Auderson, D. W., Earle, J, G.. Goldschmidt, R. E., and Powers, D. M., “The IBM System/360 Model
91: Floating-Point Execution Unit.” JBM Journ. of Res. and Dev., January 1967a, pp. 34-53,
Anderson, D. W., Sparacio, F. A., and Tomasulo, R. M., “ The IBM System/360 Model 91: Machine
Philosophy and Instruction Handling,” LBM Journ. of Res. and Dev. val, 11, no, 1, 19672, pp, 8-24,
Anderson, J. P., Hoffman, $. 4., Shifman, J., and Williams, R. J.,“*D825-4 Multiple Computer System
for Command and Control.” Proc. AFIPS Fail Joint Computer Conference, vol. 22, 1962, pp.
86-96,
Andrews, G. J.,and McGraw,J. R.,° Language Features for Parallel Processing and Resource Control,”
Proceedings of the Canference on Design and implementation of Programming Languages, lihaca,
N.Y., October 1976.
Andrews. G. R., and Schneider, F. B., “Concepts and Notations for Concurrent Programming,”
ACM Computing Surreys, vol, 15, March 1983, pp. 3-43.
Arnold, C. N.,‘* Performance Evalution of Three Automatic Vectorizer Packages,” Proc, of Invl, Conf.
Parallel Proc., 1982, pp. 238-242.
Arnold, J. S., Casey, D, P., and McKinstry, R. H.. "Design of Tightly-Coupled Multiprocessing
Programming,” (8M Systems Journal, no. 1, 1974.
Arvind and Gostelow, K. P., "A Computer Capable of Exchanging Processors for Time.” Proc, £977
IFIP Congress, North-Holland, Amsterdam, 1977.
Arvind and Gostelow, K. P., "The U Interpreter,” /EEE Comp.. vol. 15, no. 2, Feb. 1982, pp. 42-50,
Arvind and Tannucci, R. A., A Critique of Multiprocessing von Neumann Style.” Proc. f0th Ann.
Symp. Computer Architecture, June 1983, pp. 426-436.
Arvind, Kathail. ¥.. and Pingali. K., “A Data Flow Architecture with Tagged Tokens.” Techaiec/
Memo 174, Lab. for Computer Science, MIT, Sept. 1980,
Association of Computing Machinery, “Special Issue on Computer Architecture,” Comm. of ACM,
vol. 2], no. 1, Jan. 1980.
813
814 mMBLIOGRAPHY
Backus, J., "Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its
Algebra of Programs,” Comm. of ACM, vol. 21, no. 8, Aug. 1978, pp. 613-641,
Baer, J. L., “A Survey of Some Theorelical Aspects of Multiprocessing,” ACAY Computing Surreys,
vol. 5, ne. 1, March 1973, pp. 31-80.
Baer, J. L., "* Multiprocessing Systems,” JEEE Trans. on Comp., C-25, Dec. 1976, pp. 1271-1277,
Baer. J. L.. Computer Svstems Architectures, Computer Science Press. Potomac, Maryland, £980.
Baer, J. L., and Bovet. D. P., “Compilation of Aritametic Expressions for Paralle] Computations,”
Proc, IFIP Congress 1968, North-Holland, Amsterdam, 1968; pp. 340-346.
Baer, J. L.. and Ellis, C., “Model, Design, and Evaluation of a Compiler for a Parallel Processing
Environment,” /EEE Trans. on Soft. Fag., SE-3, Nov. 1977, pp. 394-405,
Bain, W. L., Jr. and Ahuja, 8. R., * Performance Analysis of High-Speed Digital Buses for Mult-
processing Systems," Prec. &th dna, Symp. Cemputer Architecture, May 1981, pp. 107-131,
Banerjee, U., Gajski, D,. and Kuck, D., “ Accessing Sparse Arrays in Parallel Memories.” Journ. of
VLSI and Computer Systems, val. |, 80, |, 1983, pp, 69-99,
Barke, B. FP. ed., Hery Large Seale integration (VY LSI3: Fundamentals ane Applications. Springer-
Verlag. New York. 1980,
Bares, G. H., Brown, R. M.. Kato, M.. Kuck, D. 5, Slotnick. D. L.. and Stokes, R.A. The ILLIAC
lv Computer,” JEEE Trans. on Computers, Aug. 1968, pp. 746-737.
Baskett, F.. and Keller, T. W., An Evaluation of the CRA Y-] Computer,” High Speed Computer and
Algorithm Organization, Kuck, et al., eds.. Academic Press, New York, 1977, pp. 71-84.
Baskett, F., and Smith, A. J., ‘Interference in Multiprocessor Computer Systems with Interleaved
Memory,” Comm. of ACM, vol. 19, ne, 6 June 1976, pp. 327-334,
Batcher, K. E., “STARAN Parallel Processor System Hardware,” Proc. AFIPS-NCC, vol. 43.
pp, 405-410.
Batcher, K. E., * The Flip Network in SFARAN,” iri. Conf. Parallel Proc., Aug. 1976, pp. 65+71.
Batcher, K. E., "The Multi-dimensional Access Memory in STARAN,” JEEE Frans. on Conaps., 1977,
pp. 174-177.
Batcher, K. E., " Design ofa Massively Parallel Processor.” JEEE Trans. on Consp., C-29, Sept, 1980,
pp. 836-840,
Baudet, G, M., * Asynchronous Iterative Methods for Multiprocessors,” Journ. of ACM, val. 25, no, 2,
Apr. 1978, pp. 226-244,
Baver. L. H.. “Implementation of Data Manipulating Funclions on the STARAN Associative Array
Processor,” Prac. Sagamore Comp. Conf Parallel Proc, Aug. 1974, pp. 209-227,
Bell, C.G., Mudge, }. C.,and McNamara, J., Computer Engineering: A DEC View of Hardware Systems
Design, Digital Press, Bedford, Mass., £978.
Bell, J.. Casasent, D., and Bell, C. G., “An Investigation of Alternative Cache Organizations,” JEEE
Trans, on Comp., C-23, Apr. 1974, pp. 346-351,
Benes, Vo EL. Mathematical Theory of Connecting Nepvorks and Telephone Traffic, Academic Press,
New York, 1965.
Bensoussan, A., Clingen. C. T., and Daley, R. C.. °° Fhe MULTICS Virtual Memery: Concepts and
Design,” Comm. of ACM, vol. 15, May 1972, pp. 308-315.
Bernstein, A. J., “Analysis of Programs for Parallel Processing,” (EEE Trans, Elec. Comp., £-15,
Oct. 1966, pp. 746-757,
Berra, P. B,, and Oliver, E,, “ The Role of Associative Array Processors in Database Machine Archi-
tecture,” JEEE Comp., Mar, 1979, pp. 53-61.
Bhandarkar. D. P., ** Analysis of Memory Imterference in Multiprocessors,” [EEE Trans. on Comp.,
C-24, Sept. 1975, pp. 897-908.
Bhandarkar, D. P.. “Some Performance Issues in Multiprocessor System Design,” JEEE Trans. on
Comp., C-26, no. 5, May 1977. pp. 506-511.
Blaauw, G., “Computer Architecture,” Elecironische Rechenanlagen, vol. 14, no, 4, 1972, pp. 154-
160,
Bluauw, G., Digited Systems Implementation, Prentice-Hall, Englewood Cliffs, N.J., 1976.
Bode, A., and Handler, W., Rechnerarchiiekture: Grundlagen und Verfahren, Springer-Verlag,, Berlin
{volumes f and 2), 1980, 1982.
WWW.Gitmgurgaon.blogspot.com
BIBLIOGRAPHY 815
Borgeson, B. R., Hanson, M. L.. and Hartley. P. A.. “The Evolution of the Sperry Univac £100 Series:
A History, Analysis, and Projection.” Comm. of ACM, vol. 21, no. 1, Jan. 1978, pp. 25-43.
Bouknight. W. J., Denenberg, 5. A., McIntyre, D. E., Randall, J.M., Sameh, A. H., and Siotnick, D.L,,
“The IHiac 1V System.” Proc. JEEE, vol. 60. no, 4, Apr. 1972, pp. 369-388.
Bovet, D. P., and Vanneschi. M., **Models and Evaluation of Pipeline Systems,” Computer Archi-
rectures and Neoverks (Gelenbe and Mahi. eds.), North Holland, Amsterdam, 1976, pp. 99-111.
Brent, R.. “The Paratlel Evaluation of General Arithmetic Expressions,” Journ, of ACM, val. 21,
no. 2, Apr. 1974, pp. 201-206,
Briggs, F. A., “Performance of Memory Configurations tor Paraliel-Pipelined Computers,” Proe. 5th
Aan. Spmp. Computer Architecture, Apr. 1978, pp. 202-209,
Briggs. F. A., “Effects of Buffered Memory Requests in Multiprocessor Systems,” Prec.
ACM/SIGMETRICS Conf. on Simulation, Measurement and Modeling of Comput. Systems,
1979, pp. 73-81,
Briggs, F. A., and Davidson, E. 8., ** Organization of Semiconductor Memories for Parallel-Pipetined
Processors, JEEE Trans. on Comp., Feb. 1977, pp. 162-169,
Briggs. F. A., and Dubois, M,, * Modeling of Synchronized [terative Algorithms for Multiprocessors,”
Proc. 18th Ann. Allerton Conf, on Communication, Control and Computing, Oct, 1980, pp, 554-563.
Briggs, F. A., and Dubois, M., “Performance of Cache-Based Multiprocessors,” Proc, ACM/SIG-
METRICS Conf. on Measurement and Modeling of Computer Systems, Sept. 1981,
Briges, F. A., and Dubois, M., ‘Effectiveness of Private Caches in Multiprocessor Systems with
Parallel-Pipelined Memories,” /EEE Trans. on Comp. C-32, no. 1. Jan. 1983, pp. 48-59,
Briggs, F. A., Dubois, M., and Hwang, K., “Throughput Analysis and Configuration Design of a
Shared-Resource Multiprocessors Systems: PUMPS,” Proc, 8th nn. Symp. Computer Archi-
tecture, May 1981, pp, 67-80,
Briggs, F. A., Fu, K. §., Hwang. K,. and Patel, J. H.. ““PM*: A Reconfigurable Multiprocessor System
for Pattern Recognition and Image Processing,” Proc, af NCC, AFIPS, June 1979, pp, 255-265.
Briggs, F. A., Fu, K.S., Hwang, K., and Wah, B. W., °° PUMPS Architecture for Pattern Analysis and
Image Database Management,” /EEE Trans. on Cony, C-31, no. 10, Oct, 1982, pp, 969-982,
Browning, 8. A., “The Tree Machine: A Highly Concurrent Computing Environment,” Ph.D. Thesis,
Dept. of Computer Science, Cal. Tech,, Pasadena, 1980.
Bucher, [. Y., “The Computational Speed of Supercomputers,” Proc. ACM; SIGMETRICS Conf. on
Measurement and Modeling of Computer Systems, Aug, 1983, pp. 151-165,
Bucholz, W., Planning a Computer System: Project Stretch, McGraw-Hill, New York, 1962,
Budnik, P. P., and Kuck, D. J. “The Organization and Use of Parallel Memories,” /EEE Trans. on
Comp., vol. 20, Dec. 1971, pp. 1566-1569. ,
Burnett, G. J., and Coffman, E. G., “A Study of Interleaved Memory Systems,” Prac, Spring Joint
Computer Conf., AFIPS, SICC, vol, 36, 1970, pp. 467-474,
Burroughs Co.. “BSP: Gverview Perspective, and Architecture,” document no, 61391, Feb. 1978a
(30 pages).
Burroughs Co., “BSP; Fioating Point Arithmetic,” document na. 61416, [9784 (27 pages).
Burroughs Co., “BSP: [Implementation of FORTRAN,” document no. 16391, Nov. 1977 (18 pages).
Buzen, J, P., 1/0 Subsystem Architecture,” Proc, (EEE, 63, June 1975, pp. 871-879,
Cappa, M., and Hamacher, V. C., “An Augmented Herative Array for High-Speed Array Division,”
IEEE Trans. on Comp., C-22, Feb. 1973, pp. 172-175.
Carlson, W. W., and Hwang, K.. "On Structural Data Accessing in Dataflow Computers,” Proc. fst
Intl, Conf. Computers and Applications, Beijing, China, June 1984.
Case, R. P., and Padegs, A., “ Architecture of the IBM System 370." Comm. of ACM. vol. 21, no. f,
1978, pp. 73-96.
Censier, L. M., and Feautrier, P.. **A New Solution to Coherence Problems in Muiticache Systems,”
IEEE Trans, on Comp,, C-27, Dec. 1978, pp. FE L2-1118.
Chamberlin, D. D., “*Paraile! [mptementation of a Single Assignment Language,” PA.D. Thesis,
Stanford Univ., Cahf.. 1976.
Chamberlin, D. D., Fuller, SH. and Liu, L. ¥., ‘An Analysis of Page Allocation Strategies for
Multiprogramming Systems with Virtual Memory,” [BM Journ, of Res. and Dev, 973.
816 BIBLIOGRAPHY
Chandy, K. M,, “*Madels for the Recognition and Scheduling of Parallel Tasks on Multiprocessor
Systems,” Bulletin of the Operations Research Society of America, vot, 23, suppl. |, Spring 1975,
p. B-Et7.
Chang, D., Kuck, D. J.. and Lawrie, D. H., ‘On the Effective Bandwidth of Paraitet Memories,”
IEEE Trans, on Comp., C-26, no. 5, May 1977, pp. 480-490,
Charlesworth, A, E., “An Approach to Scientific Array Processing: The Architecture Design of the
AP-120B/EPS-164 Family," LEEE Comp., Dec. 1981, pp. 12-30,
Chen, N. F., and Liu. C. L., “On a Class of Scheduling Algorithms for Multiprocessor Computing
Systems,” Proc, Conf. on Parallel Proc., Raquetie Lake, N-Y.. August 1974,
Chen, 8. C., “Speedup of Iterative Programs in Multiprocessor Systems,” PAD. Thesis, Univ, of TE,
ai Urb..-Champ., Dept. of Computer Science, no. 75-694, Jan. 1975,
Chen. S. C., “Large-Scale and High-Speed Multiprocessor System for Scientific Applications: Cray
X-MP Series.“ Proc. NATO Advanced Research Workshop on High-Speed Computing. J. Kawalik,
editor, Springer Verlag, Jiilich, W. Germany, June 20-22, 1983.
Chen, T. C.. ‘Pavalielism, Pipelining, and Computer Efficiency,” Computer Design, Jan. 1971, pp.
69-74,
Chen, T. C., “Overlap and Pipeline Processing,” in Jairoduction to Computer Architecture, Chap. 9
(Stone, ed.), Science Research Associates, Inc., Chicago, 1980, pp, 427-486,
Chin, C. Y., and Hwang, K., “Packet Switching Networks for Multiprocessors and Dataflow Com-
puters,” [EEE Trans, Computers, Nov. 1984, pp. 991-1003.
Chow, C. K., On Optimization of Storage Hierarchies,” (BM Journ. af Res, und Dev. May 1974,
pp. 194-203,
Chu, ¥., High Level Language Computer Architecture, Academic Press, New York, 1975.
Chu, Y.,and Abrams, M., ‘ Programming Languages and Direct-Execution Computer Architecture,”
IEEE Comp., vol, 14, July 1981, pp. 22-40,
Coffman, E. G., “Bounds on Parallel Processing of Queues with Mullipte Jobs,” Nasal Research
Logical Quarterly, 14, Sept. 1967, pp. 345-366.
Coffman, E. G., Elphick, M. J,, and Shoshani, A, °*System Deadlocks.” ACM Computer Surreys, 3,
1971, pp. 67-78,
Coffman, E.G., and Denning, P. J., Operating Systems Theory, Prentice-Hall, Englewood Cliffs, NJ.,
1973.
Coffman, E. G., and Graham, R. L., “ Optimal Scheduling for Two Processor Systems,” Acta fifor-
matica, val, 1, 1972, pp. 200-213,
Coffman, E. G., and Ryan, T. J., Jr. A Study of Storage Partitioning Using a Mathematical Model
of Locality,” Comm, af ACM, vol. 15, Mar, 1972, pp. 185-190.
Coffman, FE. G., ed., Computer and Job-Shop Scheduling Theory, John Wiley, New York, 1976.
Cohen, E., and Jefferson, D., “ Protection in the Hydra Operating System,” Proc. 344 Symp, on Operat-
ing System Principles, Nov, 1978, pp. 141-160,
Cohen, T., “Structured Flowcharts for Multiprocessing,” Computer Languages, vol, 13, no, 4, 1978,
pp. 209-226,
Connors, W. D., Florkowski, J. H., and Patton, $. K., “The IBM 3033: An Inside Look,” Datamation,
May 1979, pp. 198-218.
Conrad, ¥., and Wallach, "Iterative Solution of Linear Equations on a Parallel Processor System,”
IEEE Trans on Comp., Sept. 1977, pp. 838-847,
Conti. C. J, “Concepts for Buffer Storage,” Campurer Group News, 2, Mar, 1969, pp, 9-13.
Conti, C. J., Gibson, D. H., and Pikowsky, S. H.. “Structural Aspects of the System 360/85. General
Organization,” BMF Systems Journ. 1968, pp. 2~14.
Control Data Corp., Contre’ Data STAR-100 Features Manual, St. Paul, Minn,, pub, no. 60425500,
Oct. 1973.
Contro! Data Corp., Control Data STAR-100) FORTRAN Language Version 2 Reference Manual, St.
Paul, Minn,, pub. no. 60386200, 1976.
Control Data Corp,, CAC Cyber 260 Operating System 14 Reference Manual, St. Paul, Minn, pub.
no, 60457000, vol, 1, 1979a.
WWW.Gitmgurgaon.blogspot.com
BIBLIOGRAPHY 817
Control Data Corp., CDC Crher 200 Fortran Language 1.4 Reference Manual, $1, Paul, Minn., pub,
no. 60456040, 1979p.
Control Data Corp., CDC Crher 200; Mode/ 205 Technical Description, St. Paut, Minn., Nov. 1980.
Conway, M..° A Multiprocessor System Design,” Proc, AFIPS Fall Joint Comput. Conf. Spartan
Books, N.Y., 1963, pp. 139-146. ‘
Cooper, R.G., The Distributed Pipeline,” (EEE Trans. Comp. Nov. 1977, pp. 1123-1132,
Cordennier, V,, “A Two Dimension Pipelined Processor for Communication in a Parallel System,”
Proc, 1975 Sagamore Coip. Conf. Paralled Proc. 1975. pp. 115-122
Crane. B.A. Gilmartin, M. J. Huttenhoff. J. H., Rus. P. T., and Shively, R. Ro. “PEPE Computer
Architecture,” JEEE Conpcon, 1972, pp. 37-60.
Cray Research, Inc.. CRA ¥-} Computer System Hardware Reference Manuci, Bloomingion, Minn.,
pub. no, 2240004, 1977,
Cray Research, Inc.. CRA Y-1 Computer System Preliminary CRAY FORTRAN (CFT) Reference
Manual, Bloomington, Minn,, pub. na, 2240009, 1978.
Cray Research Inc.. CRA ¥-1 Fortran (CFT) Reference Manual, Bloomington, Minn., pub. no. 2240009,
Dec. 19794
Gustafson, R. N,, and Sparacio, F. J.. ‘* [BM 308! Processor Unit: Design Considerations and Design
Process.” {BM Journ, of Res. and Dev., vol. 26, no. 1, Jan. 1982, pp. 12-24.
Daley. R.C., and Dennis, J. B.,* Virtual Memory Process and Sharing in Mulues.”” Conn, of ACM,
vol, 11, May 1968, pp. 306-3 #1,
Datawest Corp.. “Real Time Series of Microprogrammable Array Transform Processors,” Prod.
Bulletin Series B, 1979.
Davidson. E. $., °The Design and Contro! of Pipelined Funetion Generators.” Proc, 197 fineh fEEE
Conf. on Systems, Networks, and Computers, Gaxtepee, Mexice, Jan. 1971, pp. 19-21.
Davidson, E. §., “Scheduling for Pipelined Processors,” Prac. *1h Hawaii Conf. on System Sciences,
1974, pp. 58-60.
Davidson, E. S,, Thomas, D. P.. Shar, L. E., and Patel, J. H.. * Effective Control for Pipelined Com-
puters.” COMPCON Proc., IEEE 75CH0920-9€C, 1975, pp. [8 1-184.
Davis. A. L.. “The Architecture and System Methodology of DDMI: A Reeursively Structured Data
Driven Machine,” Proc, 5th Aen, Symp, on Computer Architecture, 1978, pp. 210-215,
Davis, C. G., and Vouch, R. L.. “Ballistic Missile Defense: A Supercomputer Challenge.” /ERE
Comp. Nov. 1980, pp. 37-46.
Deitel, H. M., 4a introduction to Operating Systems, Addison-Wesley, Reading, Mass, 1984,
Deminet. J., “Experience with Multiprocessor Algorithms,” [EEE Frans. on Comp. C-31, Apr. 1982,
pp. 278-288.
Deneleor, Inc., Heterogeneous Element Processor: Principles of Operation, April 1981.
Denning, P. J.,°* The Working Set Model for Program Behavior,” Comm. of ACM, vol, 11, no, 5, 1968,
pp. 323-333.
Denning, P. J., “Operating Systems Principles for Data Flow Networks.” JEFE Comp., July 1978,
pp. 86-96.
Denning, P. J.. “ Working Sets Past and Present.” (EEE Trans. on Soft. Eng., SE-6, no. 1, Jan. 1980,
Denning, P. J.,** Virtual Memory,” 4CM Computing Surveys, 2, Sept. 1970, pp. 153-189.
Denning. P. J.,and Graham, G, S., “ Multiprogrammed Memory Management,” Proe. FEEE, vol. 63.
June 1975, pp. 924-939.
Denning, P.J., and Schwartz, 8. C.,’ Properties of the Working Set Model,” Conun, of ACM, vol. 15,
1972.
Dennis, J. B.. * Data Flaw Supercomputers,” /EEE Comp., Nav. 1980, pp. 48-56.
Dennis, J. B., ‘First Version of a Data Flow Procedure Language,” in Leetere Notes in Computer
Science, 19, Springer-Verlag, Berlin, 1974, pp. 362-376.
Dennis, J. B., Leung, C. K., and Misunas, D. P., A Highly Parallel Processor Using a Data Flow
Machine Language,” CSG Meme /34-/, Lab. for Computer Science, MIT, June 1979.
Dennis. J. B.. and Misunas, D. P., ‘A Preliminary Architecture for a Basic Data Flow Processor.”
Proce. Second Ann. Symp. on Computer Arehitecture, EEE, Jan. 1975, pp. 126-132.
S18 BIBLIGGRAPHY
Dennis. J., and Rong, G., “* Maximum Pipelining of Array Operations on Static Dataflow Machine,”
Prog. (983 Int'l, Conf. Parallel Proc., August 23-26, 1983.
Dennis, J. B., and Weng, K., “Application of Dataflow Computation to the Weather Problem,”
High Speed Computer and Algorithm Organization, Kuck, eal. eds., New York: Academic Press,
New York, 1977, pp, 143-157,
Despain, A. M., and Patterson, D. A., “'X-tree-—A Tree Structured Multiprocessor Computer Archi-
tecture,” Proc, Sth ann. Symp. an Computer Architecture, 1978, pp. 144-151,
Dias, D, M..and Jump, J. R., “Analysis and Simulation of Buffered Delta Networks.” (EEE Trans. on
Comp. C-30, Apr. (98la, pp. 273-282.
Dias, D. M., and Jump, J. R., “Packet Switching Interconnection Networks for Modular Systems,”
IEEE Comp., Dec, 19816, pp. 43-33.
Dijkstra, E, W.. “Solution of a Problem in Concurrent Programming,” Comm, af ACM, vol. 8,
Sept. 1968, pp. $69-570.
Diksira, E. W.. "Cooperating Sequential Processes,” Programming Languayes. F, Genuys, ed.
Academic Press, New York, 1968. pp. 43-112.
Dorr, F. W.. The Cray Lat Los Alamos,” Datamation, Oct. 1978, pp. 113-120,
Dowsing, R. D., “Processor Management in a Multiprocessor System.” Electrome Letters, vol. 12,
no. 24, Nov. 1976.
Dubois, M., “Analytical Methodologies for the Evaluation of Multiprocessing Structures,” PAD,
Thesis, Purdue Uniy,, [nd., 1982,
Dubois, M., and Briggs, F. A., “Effects of Cache Coherency in Multiprocessors,” [EEE Trans, on
Comp., C-3l, no, 11, Nov, 19824,
Dubois, M., and Briggs, F. A., °° Performance of Synchronized Iterative Processes in Multiprocessor
Systems,” [EEE Trans. on Saft. Eny., July 19825. pp. 419-43
Duff, M. J.B. ed., Computing Structures for Image Processing, Academic Press, London, 1983.
El-Ayat, K. A.. * The [ntel 8089: An Integrated 1/0 Processor,” /EER Comp. vol. 12, no. 6, June 1979,
pp. 67-78.
Emer, J, S., and Davidson, E, $., “Control Store Organization for Multiple Stream Pipelined Pro»
ceasors.” Proc, (978 tnel Conf. Parallel Proc, (978, pp. 43-48.
Enslow, P. H., “Multiprocessor Organization,” Computing Surrers, vol. 9, Mar. 1977, pp, 103-129.
Enslow, P. H., ed., Multiprocessors and Parallel Processing, Wiley-Interscience, New York, 1974.
Evans, D. J.,ed.. Paralle? Processing Systems, Cambridge Univ. Press, England, 1982.
Evensen, A. J., and Troy. J. L., “Introduction to the Architecture of a 288-Element PEPE,” Proc.
Sagumore Conf. Parallel Prac. 1973. pp. 162-169.
Fabry, R. 8., “Capability-based Addressing.” Conn, of ACM, vol, 17, July 1974. pp. 403-412.
Faggin, F.. "How VLSI Lmpacts Computer Architecture,” JEEE Spectrum, 15, May 1978, pp. 28-31.
Fairbairn, D. G..° VLSI: A New Frontier for System Designers,” /EEE Comp., Jan. 1982, pp. 87 96.
Feierbach, G., and Stevenson, D. K..'*The Phoenix Array Processing System,” Phoenix Praject Memo.
7, NASA Ames Research Center, Mountain View, Calif, Nov, 1978,
Feldman, J. DB. and Fulmer, L. C., “RADCAP:- Operational Parallel Processing Facility,” Proc.
Navi Comp, Conf., AFUPS, 1974, pp. 7-15.
Peller, W., 4p Introduction to Probability Theary und jty Applications, vol. |, Wiley, New York, 1970,
Femt, T. ¥.. "Some Characteristics of Associative/Parallel Processing,” Proce. /972 Sagamore Comp.
Conf., Syracuse Univ.. 1972, pp. 5~16.
Feng, T. Y.. * Data Manipulation Functions in Parallel Processors and Their Implementations.”
FEEF Trans. on Camp., C-23, no. 3, Mar. 1974, pp, 309-318.
Feng. T. ¥.,ed..° Parallel Processors and Processing.” special issue, ACM Computing Survevs, vol. 9,
no, 1, Mar. (97 7a.
Feng, T. ¥., ° Parallel Processors and Processing,” Class Notes, Wayne State Univ., Detroit, Michigan
(umpublished), 19774.
Feng, T. ¥., “A Survey of Interconnection Networks,” EEE Cump., Dec. 1981, pp. 12-27.
Fennell, K. D., and Lesser. V. RR. Parallelism in Artificial Intelligence Problem Solving: A Case Study
of Hearsay If," /EEE Trans. on Comp., Mar. 1977, pp. 98-111.
WWW.Gitmgurgaon.blogspot.com
BIBLIOGRAPHY 819
Ferrari, D., “An Analytic Study of Memory Allocation in Multiprocessing System,” Computer
Architecture and Networks, Gelenbe and Mahl, eds., North-Holland, Amsterdam, 1974.
Ferrari. D., Gelenbe, E., and Mahl, R., “An Analytic Study of Memory Allocation in Multiprocessor
Systems,” Pree. Conf. on Computer Architecture and Networks, France, August 1974,
Floating Point System, Inc. AP-/208 Processor Hanébook, Portland, Oregon, pub, no. 7259-02,
May 1976.
Flynn, M, J., “Very High-Speed Computing Systems” Proc. EEE, vol. 54, 1966, pp. 1901-1909,
Flynn, M. J, "Some Computer Organization and Their Effectiveness.” (EEE Truss. on Comp., C-21,
no. 9 Sept. 1972, pp. 948-960,
Flynn, M. J., “The {nterpretive {nterface: Resources and Program Representation in Computer
Organization.” High Speed Computer and Alyorithm Organization, Kuck et al., Academic Press,
New York, 1977, pp. 41--69,
Flynn, M. J.. and Amdahl. G. M,, “Engineering Aspects of Large High Speed Computer Design,”
Proc. Symp. Microelectronics and Large Systems, Spartan Press, Washington, D.C.. 1965, pp.
77-95,
Flynn, M. J.. Podvin, A.. and Shmizu. K.. ‘A Multiple Instcuction Stream with Shared Resources,”
Parallel Pracessar Systems, Techrotugies, and Applications, Hobbs. ed. Spartan Books.
Washington, D.C.. 1976. pp. 251-286.
Fontao, R. G., “A Concurrent Algorithm for Avoiding Deadlocks in Multiprocess Multiple Re-
sonanee Systems," Prac, 3d Symp. Operating System Principles, Oct. 1971.
Faster, C, C. (1976). Content-Addressable Parallel Processors, Van Nostrand Reinhold Co., New York,
1976,
Frania, W. R.. and Houle, P. A., “Comments on Models of Multiprocessor Multi-Memory Bank
Computer Systems," Proc, 1974 Winter Simutition Conf., vol. 1, Washington, D.C, Jan. 1974.
Fritsch, G., Klemoeder, W., Linster, C, U..and Volkert, J., “EMSY 85: The Erlangen Multiprocessor
System for a Broad Spectrum of Applications,” Proce, 1983 neh Conf. on Parallel Proc., August
1983, pp. 325-330.
Fuller. 8, H., and Harbison, §. P., The Cunmp Multiprocessor, Technical Report, Carnegie-Mellon
Univ.. Computer Science Dept., 1978.
Fuller, 8. H.. Swan, R.,and Wulf. W. A.. The Instrumentation of C.mmp: A Multi-miniprocessor,”
IEEE Compcon, |973.
Gajski, D. D., “An Algorithm for Solving Linear Recurrence Systems on Parallel and Pipelined
Machines,” /EEE Trans. on Comp., Mar. 1981. pp. 190-205.
Gakski, D. D,, Kuck, D. J., and Padua, D. A., * Dependence Driven Computation,” Proc. COMCON
Spring, Feb. 1981, pp. 168-172.
Gajski, D., Kuck, D., Lawrie, D.. and Sameh, A.,* Cedar —A Large Seale Multiprocessor,” Proc. #983
fat) Canf. on Parallel Proc., Aug. 1983, pp. 524-529.
Gajski, D. D.. Panda, D. A., Kuck, D. J.,and Kuhn, R. H., A Second Opinion on Dataflow Machines
and Languages.” [EEE Comp., Feb. 1982, pp. 58-70.
Gajski. D. D.. and Rubinfield, L. P., Design of Arithmetic Elements for Burroughs Scientific Pro-
cessor.” Proc, 40 Senpp, Computer Arithmetic, Oct. 1978, pp. 245-256.
Gao. Q. S.. and Zhang. X.. ‘Cellular Vector Computer of Vertical and Horizontal Processing with
Vertical Common Memory,” Journ, of Computers, no, 1. Jan, 1979, pp. 1-12 (in Chinese),
Gao, Q. S., and Zhang. X.. “Another Approach to Making Supercomputer by Micropracessors—
Cellular Vector Computer
of Vertical and Horizontal Processing with Virtual Common Memory,”
fat, Canf. Parallel Pree., Aug, 1988, pp. 163-164
Gaudet, G., and Stevenson, D..* Optimal Sorting Algorithms for Parallel Computers,” /EEE Trans, on
Comp., C-271, Jan. 1978, pp. 84-87,
Gecsei, J,,and Lukes, J. A., “A Model for the Evaluation of Storage Hierarchies,” /BAF Systems Journ.
no. 2. 1974, pp. 163-178.
Ginsberg, M.. “Some Numeral Effects of A FORTRAN VYectorizing Compiler on A Texas Instrnu-
ments Advanced Scientific Computer,” High Speed Computer and Algorithm Organization, Kuck,
etal.,eds.. Academic Press. New York, 1977. pp. 461-62,
820 BIBLIOGRAPHY
Goke, R.. and Lipovski, G. J., “Banyan Networks, for Partitioning on Multiprocessor Systems,”
Prac. [si Ann. Symp, Computer Archiiecture, (973, pp. 21-30,
Gonzalez, M. J., “ Deterministic Processor Scheduling,’ Computing Surveys, vol. 9, na, 3, Sept. 1977,
pp. 173-204.
Gonzalez, M. J., and Ramamoorthy, C. V., “Recognition and Representation of Parallel Processable
Streams in Computer Programs,” in Parallel Processor Systems, Technologies and Applications,
Macmillan Lid., London, England, 1970.
Gonzalez, M. J., and Ramamoorthy, €. V., ‘Parallel Task Execution in a Decentralized System,”
FEEE Trans, on Comp, C-21, Dec. 1972, 1310-1322.
Gonzalez, T., and Sahni, S., * Preemptive Scheduling of Uniform Processor Systems,” Journ. of ACM,
vol, 25, no. 1. Jan. 1978, pp. 92-101. ‘
Goodman, J. R. “An Investigation of Multiprocessor Structures and Algorithms for Database
Management,” UCBIERL M8183, Dept. of EECS, Univ, of Calif., Berkeley. 1981.
Goodyear Aerospace Ca., “Massively Parallel Processor (MPP), Tech. Report GER-16684, July
1979.
Gosden, J. A., “Explicit Parallel Processing Description and Cantrol in Programs for Multi and Uni-
processor Computers,” 4 FIPS Fall Joint Comput. Conf. Spartan Books, N.Y., 1966, pp.651-660,
Gostelow, K. P., and Thomas, R. E.,** Performance ofa Simulatec Dataflow Computer,” [EEE Trans.
on Comp., Oct. 1980, pp. 905-919.
Goitheb, A., Grishman, R., Keuskal, C, P.. McAaliffe, K. P., Randolph, L., and Snir, M., "The NYU
Ultracomputer-Designing an MIMB Shared Memory Parallel Computer,” /EEE Trans. on
Comp, Feb, 1983, pp. 175-189,
Graham, G. S.. A Study of Program and Memory Policy Behavior, PAD, Thesis, Purdue Univ.,
Ind., 1976.
Graham, R. L., “Bounds on Multiprocessing Anomalies and Packing Algorithms,” Proc, AFIPS 1972
Spring Joint Comp, Conf., 40, AFEPS Press, Montvale, N.J., 1972, pp. 205-217.
Graham, W. R., "The Parallel and the Pipeline Computers,” Daramation, Apr. 1970, pp. 68-71.
Grimsdale, R. L., and Johnson, D. M., “A Modular Executive for Multiprocessor Systems,” Proc.
Conf. on Trends in Qne-Line Computer Control Systems, Sheffield, England, Apr. 1972.
Grinbect, J., Nudd, G. R,, and Etchella, R. D., *A Cellular VLSE Architecture,” [EEE Comp., Jan,
1984.
Grohoski, G. R., and Patel, J. H., “A Performance Model for Instruction Prefetch in Pipelined
Instruction Units,” Prec, /982 atl. Conf. Parallel Proc., August 24-27, 1982, pp. 248-252.
Gula, J. L., “Operating System Considerations for Multiprocessor Architecture,” Proc. 71h Texas
Conf. on Computing Systems, Houston, Nov, 1978.
Gurd. J., and Watson, 1, “Data Driven System for High Speed Parallel Computing,” Computer Design,
Parts I & IE, June & July 1980.
Habermann, A. N.,[Introduction to Gperating System Design, Science Res. Assoc., 1976,
Hallin, T. G,, and Fiynn, M. J., “Pipelining of Arithmetic Funetions,” JEEE Trans. on Comp,, Aug.
1972, pp. 880-886.
Handler, W., “The Impact of Classification Schemes on Computer Architecture,” Proc. [977 Ini.
Conf. on Parallel Proc., pp. 7-15.
Hansen, P. B., “The Programming Language Concurrent Pascal,” FEEE Trans. on Soft. Eng., 1, 2,
June 1975, pp, 199-207.
Hansen, P. B., Comcurrenr Pascal, Prentice-Hall. New York, 1978.
Hansen, P. B., The Architecture of Concurrent Programs, Prentice-Hall, Englewood Cliffs, N.J., (977.
Harris, J. A, and Smith, D, R., “Hierarchical Mulliprocessor Organizations,” Proc. 4s Symp. an
Computer Architecture, 1977. ,
Hayes, J. P., Computer Architecture and Organization, McGraw-Hill, New York, 1978.
Hedlund, K. S., “Wafer Scale Integration of Parallel Processors,” Ph.D. Thesis, Comp. Science Dept.,
Purdue Liniv,, Ind,, 1982,
Hellerman, H., Digital Computer System Principles, McGraw-Hill, New York, 1967, pp. 228-229,
Hellerman, H., and Smith, H. J., Jr, “Throughput Analysis of Some Idealized Input, Output, and
Compute Overlap Configurations,” Computing Surveys, 2, June 1970, pp. 111-118,
WWW.Gitmgurgaon.blogspot.com
BIBLIOGRAPHY 821
Higbie, L. C,, “ Applications of Vector Processing,” Computer Design, Apr. 1978, pp. 139-145.
Higbie, L. C.. Supercomputer Architecture,” JEEE Comp., 6, Dec. 1973, pp. 48-58.
Hintz, R. G., and Tate, D. P,, “Control Data STAR-100 Processor Design,’ COMfPCON Proc.,
Sept. 1972, pp. 1-4.
Hoare, ©, A. R., “Towards a Theory of Parallel Programming,” Operating Systems Techniques,
C.A.R. Hoare, ed., Academic Press, New York, 1972,
Hoare, C. A, R., “Monitors: An Operating System Siructuring Concept.” Comm. of ACM, vol. 17,
no, 10, Oct, 1974, pp. $49-$57,
Hockney, R. W., and Jesshope, C. R., Parallel Compuyrers: Archtrecture, Programming and Algorithms,
Adam Hilger Ltd., Bristol, England, 1981.
Holley, L. H.. Parmlee, R. P.. et al., “VM/370 Asymmetric Multiprocessing,” BM Sysrems Journ.
vol, 18, no. 1, 1979,
Holt, R. ©., “Some Deadlock Properties of Computer Systems,” 4CM Computing Surveys, 4, Sept.
1972, 179-198.
Hol, R. C., Graham, G. S., Lazowska, E. D., and Scott, M. A., Structured Concurrent Pragramming
with Operating Systents Applications, Addison-Wesley, Mass., 1978.
Hon, R., and Reddy, D. R., The Effect of Computer Architecture on Algortthm Decomposition and
Performance,” High-Speed Computers and Algorithm Organization, Kuck, et al, ed., Academic
Press, New York, 1977, pp. 411-421,
Hoogendoorn, C. H., “A General Model for Memory Interference in Multiprocessors,” /EEE Trans,
on Comp, C-26, no, 10, Oct, 19774, pp. 998-1005,
Hoogendoorn, C. H., “Reduction of Memory Interference in Multiprocessor Systems,” Proc. 4th
Ann. Symp. on Computer Architecture, Silver Springs, MD, Mar, 19776, pp, 179-183,
Hsiao, D. K., ‘Data Base Computers,” in Advances in Computers, vol. 19, Yovits, ed., Academic Press,
New York, 1980, pp. 1-64.
Hsiao, D.K.,ed., Advanced Database Machine Architecture, Prentice-Hall, Englewood Cliffs, N.J., 1983.
Hu, T. C., “Parallel Sequencing and Assembly Line Problems,” Oper. Res., vol. 9, no, 6, Noy.-Dec.,
1961, pp, 841-848.
Hufnagel, S.,“* Comparison of Selected Array Processor Architecture,’ Computer Design, Mar, 1979, pp.
1S1~ 188,
Hwang, K., * Fault-Tolerant Microprogrammed Digital Controller Design,” ZEEE Trans. on Industrial
Electronics and Control Instrumentation, Aug. 1976, pp. 200-207.
Hwang, K., “Global and Modular Two’s Complement Array Multipliers,” JEEE Trans. on Conip.,
Apr. 1979a, pp. 300-306.
Hwang, K,, Computer Arithmetic: Principies, Architecture and Design, Wiley, New York, 19798,
Hwang, K.,“ VLSE Computer Arithmetic for Real-Time Image Processing,’ Chap. 7, VLSI Elecironics:
Microstruciure Science, val. 7, Einpruch, ed., Academic Press, New York, 1984.
Hwang, K., and Chang, T. P., “Combinatorial Rehability Analysis of Multiprocessor Computers,”
IEEE Trans. Reliability, vol. R-31, no. $, Dec. 1982, pp. 469-473.
Hwang, K., and Cheng, Y. H., “Partitioned Matrix Algorithms for VLSI Arithmetic Systems,” [EEE
Trans, on Comp. C-31, no. 12, Dec. 1982, pp. 1215-1224
Hwang, K., Chin, CUY., and Ni., L. M., ‘Adaptive Path-Directed Routing for Packet Switched Com-
puter Networks,” TR-£E §3-37, Purdue Univ., Ind., 1983.
Hwang. K,, and Fu, S. K., “‘lmtegrated Computer Architectures for Image Processing and Database
Management,” /EEE Comp., vol. 16, no. 1, Jan, 1983, pp. 5f-6t.
Hwang, K., ed, Tutorial on Supercomputers Design and Applications, IEEE Computer Society Press,
Silver Spring, Maryland, August 1984.
Hwang, K., and Ni, L. M,, “ Resource Optimization ofa Parallel Computer for Multiple Vector Pro-
cessing,” ZEEE Trans, an Comp., C-29, Sept. 1980, pp. 831-836.
Hwang, K,, and Su., 8. P., “VLSI Architectures of Feature Extraction and Pattern Classification,”
Computer Vision, Graphics, and hnage Processing, vol. 24, Academic Press, New York, Nov.
19834, pp. 215-228.
Hwang, K., and Su, 8. P., “ Priocity Scheduling in Event-Driven Dataflow Computers,” 7R-EE 83-86,
Purdue Univ., Ind. Dec. 19834.
$22 BIBLIOGRAPHY
Hwang, K., and Su, S. P., “ Muttitask Scheduling in Vector Supercomputers,” TR-EE 83-52, Purdue
Univ,, Ind,, Dec, 19836.
Hwang, K., Su, 5. P., and Ni, L. M., “ Vector Computer Architecture and Processing Techniques,”
Advanced in Computers, vol. 20, Yovits, ed., Academic Press, New York, 1981, pp. 115-197,
Hwang, K., and Yao, $. B., “Optimal Batched Searching of Tree-Structural Files in Multiprocess
Computer System,” Journ. of Assoc, of Comp, Mach,, vol, 24, no. 3, July 1977, pp, 441-454,
IBM Corp, BM Systent/360 and System!370 HO Interface Channel to Contrel Unit, form GA22-6974-3,
1976,
TBM Corp., /BM 3838 Array Processor Functional Characteristics, no, 6A24-3639-0, file no. 5370-08,
Endicott, M.Y., Oct, 1976.
IBM Corp., /BM Systent/370 Model 168 Functional Characteristics, form no. GA22-7010-4, 1976.
IBM Corp., 3033 Processar Complex Theory of Operation! Diagrams Manual. vols. 1-5, SY¥22-7001
through SY22-7005, Jan, 1978,
IBM Corp., “Special Issue on IBM 3081," /BM Journ. of Res, and Dev,, vol. 26, no. 1, Jan. 1982,
pp. 2-29,
IBM Corp., System/376 Principles of Operation, GA22-7000-4, 1974,
Isloor, 8. S., and Marsland, T, A., ““The Deadlock Problem: An Overview,” /EFEE Computer, vol. 13,
no, 9, Sept. 1980.
Jain, N., Performance Study of Synchronization Mechanisms ina Multiprocessor, Ph.D. Thesis, Carnegie-
Mellon Univ., 1979.
Jensen, J. E., and Baer, J. L., “A Model of Interference in a Shared Resource Multiprocessor,” Prac.
3d Ann. Symp. on Computer Architecture, Clearwater, Fla., Jan. 1976, pp. 2-57.
Jin, L.,**A New General-Purpose Distributed Multiprocessor Structure,” Proc. Inti. Conf. on Parallel
Proc., Aug. 1980, pp. 133-154.
Johnson, L., “Gaussian Elimination on Sparse Matrices and Concurrency,” Tech. Report 4087; TR:80,
Dept. of Computer Science, Cal, Tech., Pasadena, 1980.
Jones, A. K., and Gehringer, E. F. (Editor), ““Cm* Multiprocessor Project: A Research Review,”
Tech, Rept, CMU-CS-80- 137, Carnegie-Mellon Univ., July 1980,
Jones, A. K., and Schwarz, P., “Experience Using Multiprocessor Systems: A Status Report,’ Dept.
of Computer Science, Carnegie-Mellan Univ,, Tech. Report CMU-CS-79-146, Oct, 1979.
Jordan, H. F., ‘Performance Measurement of HEP-A Pipelined MIMD Computer,” Proc. J0th Ann.
Syaip. Computer Architecture, June 1983, pp. 207-212.
Jordan, H, F., Scalabrin, M., and Calvert, W., ““A Comparison of Three Types of Multiprocessor
Algorithms,” Proc, 1979 infi, Conf. on Paralle! Proc., Bellaire, MI, Aug. 1979, pp. 231-238.
Jump, J. R., and Ahuja, S. R., ‘Effective Pipelining of Digital Systems,” JEEE Trans. on Comp.,
Sept. 1978, pp. 855-865,
Kaplan, K. R., and Winder, R. V., “Cache-Based Computer Systems,” Computer, 6. Mar. 1973,
30-36.
Karp. K,. M., and Miller, R. E., “Properties of a Model for Parallel Computations; Determinancy,
Terminating, Queueing,” S144 Journal of Applied Mathemazics.vol, 24, Nov. 1966.pp. 1390-1411
Karplus, W. J., and Cohen, D., “Architectural and Software Issues in the Design and Application of
Peripheral Array Processors,” IEEE Comp., Sep. 1981, pp. 11-17.
Kartashey, S. I., and Kartashev, §. P., “Problems of Designing Supersystems with Dynamic Architec-
tures,” EEE Trans, on Comp., Dec, 1980, pp. Lid 1132,
Kascic, M. J., Vector Processing on the Cyber 200, Contro) Data Corp, 1979 (38 pages).
Katzan, H., Computer Organizativn and the System/360, Van Nostrand Reinhold, New York, 1971,
Kaufman, M., “An Almost-optima! Algorithm for the Assembly-line Scheduling Problem,” /EEE
Trans, on Comp., C-23, Nov. 1974, pp. 1169-1174.
Keller, R. M., “Look-Ahead Processors,” ACM Computing Surveys, vol, 7, no, 4, Dec. 1975, pp.
177~195,
Keller, R. M., Patil, S. S., and Lindstrom. G., “A Loosely Coupled Applcative Multiprocessing
System,” Prec, Natl. Computer Cuonj., AF YAS Press, 1979.
Kennedy, K., “Optimization of Vector Operations in an Extended Fortran Compiler,” 7BM Research
Report, RC-7784, 1979,
WWW.Gitmgurgaon.blogspot.com
BIBLIOGRAPHY §23
Kinney, L. L., and Arnold, BR, G., “Analysis of a Multiprocessor System with a Shared Bus,” Proe,
Sth Ann. Symp. on Computer Architecture, Palo Alto, CA, Apr, 1978, pp. 89-95.
Kleinrock, L. Queueing Systems: Theory and Applications, Wiley, New York, 1975.
Knuth, D. E., and Rao, G.S., * Activity in Interleaved Memory,” /EEE Trans. on Comp., C-24, no. 9,
Sept. 1975, pp. 943-944.
Kober, R., and Kuznia, C., “SMS—A Multiprocessor Architecture for High-Speed Numerical
Calculations,” Proc. dnvd. Conf. Parallel Proc., 1978, pp. 18-23.
Kogge, P. M., “The Microprogramming of Pipelined Processors,” Proc. 41 Ann. Conf. Computer
Architecture, EER no, 77CH 1182-3C, Mar. 19774, pp. 63-69.
Kogge, P, M., “Algorithm Development for Pipelined Processors,” Proc. 1977 intl, Conf. Parallel
Proe,, IEEE no, 77 CH1253-4C, Aug, 19776, p. 217,
Kogge, P. M., The Architecture of Pipetined Computers, McGraw-Hill, New Yark, 1981.
Kogge, P. M., and Stone, H.S., ‘A Paratlel Algorithm for the Efficient Solution ofa General Class of
Recurrence Equations,” ZEEE Trans. on Comp., C-22, 1973, pp. 786-793.
Kosinski,P. R., "A Data Flow Programming Language,” Report RC4264, 1BM, T. J. Watson Research
Center, Yorktown Heights, N.Y., Mar. 1973.
Kozdrowicki, E. W., and Theis, D. J., Second Generation of Vector Supercomputers,” JEEE Cam-
puter, Nov. 1980, pp. 71-83.
Kraley, M. F., "The Pluribus Multiprocessor,” Digest of Papers, 1973 Int_l. Symp. on Fault-Tolerant
Computing, Paris, France, June 1975, p. 251.
Kuck, D. J., “Parallel Processing of Ordinary Programs,” in Advances in Computers, vol. 15, Rubinoff
and Yovits, eds., Academic Press, New York, 1976, pp. 119-179.
Kuck, D. J,, ‘‘IHiac EV Software and Application Programming,” ZEEE Trans, on Comp, Aug. 1968,
pp. 746-757,
Kuck, D. J., “A Survey of Parallel Machine Organization and Programming,” ACM Computing
Surveys, vol. 9, 00. 1, Mar. 1977, pp. 29-59,
Kuck, D. J., The Structure of Computers and Computations, vol. 1, Wiley, New York, 1978.
Kuck, D.J., Kuhn, R. H., Padua, D. A., Leasure, B., and Wolfe, M., ‘Dependence Graphs and Com-
piler Optimizations,” Proc. 8h ACM Symp. Principles Programming Languages, Fan. 1981,
pp. 207-218.
Kuck, D. J., Lawrie, D. H., and Sameh, A. H., eds. High Speed Computer and Algorithm Organization,
Academic Press, New York, 1977.
Kuck, D. J., and Stokes, R. A., ‘The Burroughs Scientific Processor (BSP)” /EEE Trans, on Comp.,
C-31, no, 3, May 1982, pp. 363-376.
Kuhn, R. H., “Optimization and Imerconnection Complexity for: Parallel Processors, Single-Stage
Networks, and Decision Trees,” Ph.D. Thesis, Univ. of IL at Urb-Champ., Dep. of Computer
Science, no. 80-1009, Feb. 1980.
Kuhn, R. H., and Padau, D. A., eds.. Tutorial on Paraite! Processing, JEEE Computer Society Press,
order no. 367, Los Angeles, 1981.
Kulisch, U. W.,and Miranker, W. L., eds,, 4 New Approach to Scientific Computation, Academic Press,
New York, 1983.
Kung, H. T., “Synchronized and Asynchronous Parallel Algorithms for Multiprocessors,” Algorithms
and Complexity: Recent Results and New Directions, Traub, ed., Addison-Wesley, 1976.
Kung, H. T., “The Structure of Parallel Algorithms,” Advances in Computers, vol. 19, Yovits, ed.,
Academic Press, New York, 1980, pp. 65-112.
Kung, MH. T., “Why Systolic Architectures,” JEEE Comp., Jan. 1982, pp. 37-46.
Kung, H. T., and Leisersan, ©. E., ‘Systolic Arrays (for VLSI,” Spare Matrix Prec., Duff, etal., eds.,
Society of Indust. and Appl. Math., Philadelphia, Pa., 1978, pp, 245-282,
Kung, 8. ¥., Arun, K. §., Galezer, R.J., Rao, D. V. B., “Wavefront Array Processor: Language,
Architecture, and Applications,” JEEE Trans. on Comp. C-31, no. 11, Nov. 1982, pp. 1054-1066.
Kurinckx, A., and Pujolle, G.. “ Analytic Methods for Multiprocessor Modeling,” 4th atl. Symp. on
Modeling and Performance Evaluation of Computer Systems, Part I, Vienna, Austria, Feb, 1979,
Kurizberg, J. M., ‘On the Memory Conflict Problem in Multiprocessor Systems,” /EEE Trans. on
Comp., C-23, no. 3, Mar. 1974, pp. 286-293.
$24 BIBLIOGRAPHY
Lamport, L., “A New Solution of Dijkstra’s Concurrent Programming Problem,” Camm. of ACM,
vol, 17, Aug. 1974. pp. 453-454.
Lamport, L., ° The Synchronization of Independent Processes,” Acta /nformatica, vol. 7, no. 1, 1976,
pp. 15-34,
Lamport, L., “ Proving the Correctness of Multiprocess Programs,” JEEE Trans. on Saft, Eng., SE-3,
no, 2, Mar. 1977, pp. 125-143.
Lampson, B. W., “Dynamic Protection Structures,” 1969 Fall Joint Computer Conference, AFIPS
Press, 1969, pp, 27-38,
Lampson, 8. W., “ Protection,” Operating Systems Review 8(1), jan. 1974,
Lampson, B. W., and Sturgis, HW. E., ** Reflections on an Operating System Design,’ Comm. of ACM,
vol, 19, May 1976, pp. 251-265,
Lane. W. G.. ‘Input/Output Processing,” Jatroduction ta Computer Architecture, Stone, ed., SRA Inc..
1978, pp, 275-316.
Lang, D. E., Agerwala, T. K.. and Chandy, K. M., “A Modeling Approach and Design Too! for
Pipelined Central Processors,” Proc. 6th Ann, Symp. on Computer Architecture, Apr. 1979, pp,
122-129.
Lang. T., and Stone, H.S., “A Shuffle-Exchange Network with Simplified Control.” JEKE Trans, on
Comp., C-25, Jan. 1976, pp. 55-56,
Larson, A. G.. “Cost-Effective Processor Design with an Application to Fast Fourier Transform
Computers," Ph.D. Thesis, Stanford Univ., 1973,
Larson, A. G., and Davidson, E. S., “Cost-Effective Design of Special-Purpose Processors: A Fast
Fourier Transform Case Study,” Prec. [ith Allerton Conf, 1973, pp. 547-587,
Lawrence Livermore Laboratory. ‘The 8-1 Project; dianual Reports.” vol. | Architecture, vol, 2
Hardware, and voi. 3 Software, UCID-18619, Univ. of Calif., Livermore, 1979,
Lawrie, B. HL, * Access and Alignment of Data in an Array Processor,” JEEE Trans. on Comp., C-24,
no, 12, Dec, 1975, pp, 1145-1155,
Lawrie, D. H,. Layman, T., Baer, D., and Randal, 3. M., “Glypnir--A Programming Language for
ELLEAC IV," Comm. of ACM, vol. 18, Mar. 1975, pp. 157-164.
Lawrie. D. H., and Vora, C. R., “The Prime Memory System for Array Access.” (EEE Trans. oit
Camp., C-34, no. 5, Oct. 1982. pp. 435-442,
Lee, R. B-L., “Empirical Results on the Speed, Efficiency, Redundancy and Quality of Parallel Com-
putations,” Jatl, Conf. Parallel Proe,, Aug. 1980, pp. 91-96.
Levy. H. M.. Cupahitity- Based Computer Systents, Digital Press, 1983.
Levy, H. M., and Eckhouse, R. H.. Jr. Computer Programming and Architectire— The Vax-l,
Digital Press, 1980.
Li, H. F., “Scheduling Trees in Parallel Pipelined Processing Environments,” /EEE Trans. on Comp.,
Nov, £977, pp, H1G1-1E12.
Linealn, N. R., “Technology and Design Trade Offs in the Creation of a Modera Supercomputer,”
LEEE Trans. on Comp.,C-3|, no. 5, May 1982, pp. 363-376.
Lint, B.,and Agerwala, T., “ Communication Issues in the Design and Analysis of Paralle! Algorithms,”
(EEE Trans. on Soft. Eng., SE-7, 10, 2, Mar. 1981, pp. 174-188,
Lipovski, G. J.. and Malek, M., ‘A Theory for f iticomputer Interconnection Networks,” Tech.
Report TRAC-46, Univ, of Texas, Austin, Mar. 1981,
Lipovski. G. J., and Tripathi, A., “A Reconfigurable Varistructured Array Processor,” Proc. {977
detl Conf. on Parallel Proc., 1977, pp. 165-174.
Liptay, J.S.. “StructuralAspects of System 360/85; The Cache," /BM Srstems Journ..7.1969, pp. 15-21.
Liu. J. W. S., and Liu, €. L.. Bounds on Scheduling Algorithms for Heterogeneous Computing
Systems," Prac. FIP Congress 74, 1974, pp. 349-353.
Liu, J. W. SS. and Liu, C. L., “Performance Analysis of Multiprocessor Systems Containing Func-
tionally Dedicated Processors,” Acta Informatica, vol. 19, ne. 1, 1978, pp. 95-104,
Loomis. H. H., “The Maximum Rate Aceumulater,” /EEE Trans, on Comp., vol. EC-15. no. 4, Aug.
1966. pp. 628-639.
Lorin. H. (1972), Parallelism in Hardware and Sofiware, Prentice-Hall, Englewood Ciffs, N.J., 1972.
Madnick, 8. E.. and Donovan. J. J., Operating Systems, McGraw-Hill, New York, 1974.
WWW.Gitmgurgaon.blogspot.com
BIBLIOGRAPHY 825
Majithia, J. C., ‘Cellular Array for Extraction of Squares and Square Roots of Binary Numbers,”
IEEE Trans. on Comp., C-20, no. 12, Dec. 1970, pp. 1617-1618.
Marathe, M., and Fuller, S. H.. “A Study of Multiprocessor Contention for Shared Data in C.mmp.,.”
ACM/SIGMETRICS Conf., Washington, D.C., Dee, 1977,
Matick, R. EL, Computer Storage Systems and Technology, Wiley, New York, 1977.
Matick, R. E., “Memory and Storage.’ Introduction to Computer Architecture, Stone, ed., SRA Inc.,
1980, pp. 205-274.
Mattson, R. L., Geesei, J., Slutz, D. R., and Traiger, |. L., * Evaluation Techniques for Storage Hier-
archies," {BAd Systems Journ,9, 1970, 78-117,
Mazare, G., ‘Multiprocessor Systems,” Proc, 1974 CERN Scheel of Computing, Godysund, Norway,
Aug. 1974.
McGraw, J. R., “Data Flow Computing-~Software Development,” JEEE Trans, on Computers,
Dee, 1980, pp. 1095-1103.
Mead, C., and Conway, L., datroduction to VLSI Systems, Addison-Wesley, Mass., 1980,
Miranker, G, S., “Implementation of Procedures on a Class of Data Flow Processars,” Proc, Int'l.
Conf. Parallel Proc., EEE no. 77CH 1253-4C, 1977, pp. 77-86.
Miura, K., and Uchida, K., *FACOM Vector Processor VP-100/VP-200," Prac, NATO Advanced
Research Workshop on High-Speed Computing, Jilich, W. Germany, Springer-Verlag, June 20-22,
1983.
Moldovan, D. 1, “On the Design of Algorithms for VLSI Systolic Array.” Prac, LEE , Jan. 1983,
pp. 113-120.
Moto-oka, T., “Overview to the Fifth Generation Computer System Project,” Proc. (0th Ann. Symp.
Computer Architecture, June 1983, pp. 417-422.
Moto-oka, T,, and Fuchi, K., ‘The Architectures in the Fifth Generation Computers,” Prac, 1983
IFIP Congress, North-Holland, Amsterdam, 1983, pp. 589-602,
Mrasousky, 1, Wong, J. ¥., and Lampe, H. W., ‘Construction ofa Large Field Simulator on a Vector
Computer,” Journal of Petroleum Teeh., Dec, 1980, pp. 2253-2264.
Mueller, P. T., Siegel, L. J., and Siegel, H. T., “Parallel Algorithms for the Two-dimensional FFT,”
Proc. Sth In€lL. Conf, on Pattern Recog. and tmage Proe., Dec. 1980, pp. 497-502.
Muntz, R. R., and Coffman, E. G., “Optimal Preemptive Scheduling on Two Processor Systems,”
TEEE Trans. on Comp., C-18, Nov. 1969, pp. LO14-1020,
Muragka, Y., “Parallelism Exposure and Exploitation in Programs,” Ph.D, Thesis, Univ. of I! at
Urb.-Champ., Depi, of Computer Science 71-424, Feb, 1971.
Myers, G, J, Advances in Computer Architecture, Wiley, New York, 1978.
Nassimi, D., and Sahni, S. H., “Data Broadcasting in SIMD Computers,” Proc. fat l Conf. Parallel
Proc, Aug. 1980, pp. 325-326,
Nessett, D. M., “The Effectiveness of Cache Memories in a Multiprocessor Environment,” dustradian
Computer Journ, vol, 7, no. l, Mar..1975, pp. 33-38,
Newton, G., “Deadlock Prevention, Detection and Resolution: An Annotated Bibliography,”
ACM Operating Sys. Review, vol, 13, no. 2, Apr. 1979, pp. 33-44.
Newton, R.S., ‘An Exercise in Multiprocessor Operating System Design,” Agard Conf. Proc, No, 149
on Real-Time Coniputer-based Systems, NATO Advisory Group on Aerospace R & D, Athens,
Greece, May 1974.
Ni, L. M., “Performance Optimization of Parallel Processing Computer Systems,” Ph.D. Thesis,
Schoo] of Electrical Engineering, Purdue Univ., Ind., Dec. 1980.
Ni, L. M., and Hwang, K., “Performance Modeling of Shared Resource Array Processors,” /EEE
Trans. on Soft. Eng, SE-7, no. 4, July 1981, pp. 386-394.
Ni, L. M., and Hwang, K., “Vector Reduction Techniques for Arithmetic Pipelines,” JEEF Trans,
Computers, May. 1985.
Nolen, J.5., Kuba, D. W., and Kascic. M. J, Jr.. “Application of Vector Processors to the Solution
of Finite Difference Equations,” A/ME Set Symp. Reservoir Simulation, Feb, 1979, pp.
37-44,
Nuit, G. J., "A Parallel Processor Operating System.” (EEE Trang. on Soft, Eng,, SE-3, no. 6, Nov,
197 7a, pp. 467-475.
$26 BIBLIOGRAPHY
Nutt, G.}., Memory and Bus Conflict in an Array Processor,” LEEE Trans. an Comp.. June 1977,
pp. 514-521.
Oleinick. PN. The lmplementation and Evaluation of Parallel Algorithms on C.mmp, Ph.D. Disserta-
tion, Carnegie-Mellon Univ., 1978.
Oleinick, P. H., and Fuller, S. H., "The tmplementation and Evaluation ofa Parallel Algorithm on
C.mmp.” Technica! Report, Carnegie-Mellon Univ. Computer Science Dept., Dec. 1977.
Orcutt, 5. E., “Computer Organization and Algorithms for Very High-Speed Computations,” PA. D.
Thesis, Stanford University, Calif., 1974.
Ousterhout, J. K., °° Partitioning and Cooperation ina Distributed Multiprocessor Operating System:
Medusa,” PAD. Thesis, Carnegie-Mellon Univ., April £980.
Owen, G. J., " Rallback—A Method of Process and System Recovery,” Proce. Conf. on Soft, Eng. for
Telecommunication Switching Systems, Colchester, England, Apr. 1973.
Owicki, $., and Gries, D., “Verifying Properties of Parallel Programs: An Axiomatic Approach,”
Comm, of ACM, vol. 19, $, May 1976, pp. 279-85.
Padua, 3. A. Kuck, DB. J., and Lawrie, D. H., “High-Speed Multiprocessors and Compilation
Techniques,” /EEE Trans. on Comp., C-29, Sept. 1980, pp. 763-776.
Panda, D. A.. Kuck, D. J. and Lawrie, D. H.. “High-Speed Multiprocessors and Compiling Tech-
niques.” (EEE Trans. on Conp., Sept. 1980, pp. 763-776.
Parasuraman, B.," Pipelined Architectures for Microprocessor.” COM PCON Proc., 1976, pp, 225-228,
Parnus, D. L.. “On the Criteria to Be Used in Decompositng Systems into Modules,” Comm. af ACM,
vol. 1S, Dec. 1972.
Patel, J. H., “Improving the Throughput of Pipelines with Delays and Buffers,” PAD. Thesis, Uni-
versity of Hlinois at Urb.«Champ., 1976.
Patel, J. H., * Pipelines with Interna! Buffers,” Proc. 544 Ann. Sump. an Computer Architecture, Apr.
1978, pp. 249-254.
Patel, J. H., * Performance of Processor-Memory Interconnections for Multiprocessors.” fEEFE Trans.
on Camp. Oct. LO8h. pp. 771 780.
Paul, G., “Large-Scale Vector/Array Processors,” JBM Rescurci Report, RC 7306, Sept. 1978 G4
pages).
Paul, G., and Wilson, M. W., The FECTRAWN Language: An Experimental Language for Veetor; Matrix
Array Processing, BM Palo Alto Scientific Center Report 6320-3334, Aug, 1975,
Pearce, R. C., and Majithia, J. C., * Analysis of a Shared Resource MIMD Computer Organization,”
(EEE Trans, on Comtp., ©-27, na. 1, Jan. 1978, pp. 64-67.
Pease, M.C., An Adaptation of the Fast Fourier Transform for Parailel Processing,” Jow'n. of ACM,
vol. 15, Apr. 1968. pp. 252-264.
Pease, M. C., “The Indirect Binary n-cube Microproeessor Array,” (EEE Trans. on Comp., C-25,
May 1977, pp. 458-473,
Perrou, KR. H., A Language for Array and Vector Processors,” ACM Trans. on Programming Lan-
guages and Systems, vol. 1, no, 2, Oct. 1979, pp. 177 195,
Peterson, J. Lu." Petri Nets,” ACM Computing Surreys, 3028: Sept. 1977, pp, 223-252.
Pradhan, D. K.. and Kodandapani. K. L.. “A Uniform Representation of Single- and Multistage
Interconnection Networks Used in SIMD Machines.” /EEE Trans. on Comp., Sept. 1980,
pp. 777-790.
Prasad, N.S. Arefdtecture and finplemenration of Large Scale 18M Computer Systems, Q.E.D. Infor.
mation Sciences, Inc., Wellesley. Mass., 198).
Preparata, F,P., “Parallelism in Sorting,” Proc, #977 fal Conf. on Paraliel Proe., Detroit, Mich., Aug.
1977, pp. 202-206,
Preparata, F. P., and Vuillemin, J. E., The Cube-Connected Cycles: A Versatile Network for Parallel
Computation,” Proc. 20th Symp, Foundations of Computer Science, 1979, pp. 140-147.
Preston, K., Dull, M. J.B. Levialdi, DLS. Norgren, P. E.. and Toriwaki, J. 1. “Basies of Cellular
Logic With Some Applicationsin Medical Image Processing,” Prec, FEEE, May 1979, pp. 826-856.
Prieve, BAB. and Fabry, R.S..° VMIN-- An Optimal Variable-Space Page Replacement Algorithm,”
Comme, of ACM, vol 19. May 1976, 295 297,
WWW.Gitmgurgaon.blogspot.com
BIBLIOGRAPHY $27
Purcell, C. J., “The Control! Data STAR-100- —Performance Measurements,” AFIPS NCC Proc.,
1974, pp. 385-387,
Radoy, C. H., and Lipovski, G. J., “Switched Multiple instruction Multiple Data Stream Processing,”
Prec, 2d Ann. Symp, Computer Architecture, (974, pp, 183-187.
Ramamoorthy, C. V., Chandy, K. M., and Gonzalez, M. }.. “Optimal Scheduling Strategies in a
Multiprocessor System,” /EEF Trans. an Comp., C-21, 2, Feb. 1972, pp. 137-146.
Ramamoorthy, C. V.,and Gonzalez, M. J., “ Recognition and Representation of Parallel Processable
Streams in Computer Programs, H (Task/Process Parallelism). Proc. ACM 24th Nat. Conf),
ACM, New York, 1969, 387-397,
Ramamoorthy, C. V,, and Kim, K. H., "Pipelining. The Generalized Concept and Sequencing
Strategies,” NCC Proc., AFUPS Press, 1974, pp. 289-97,
Ramameorthy, C. V.. and Li, H. F., “Sequencing Control ia Multifunctional Pipeline Systems,”
Proc. 1975 Sagamore Comp. Conf. on Parallel Proc., 1975, pp. 79-89,
Ramameorthy, C. V., and Li, H. F., * Pipeline Architecture,” ACM Computing Surreys, vol. 9, no. |,
Mar. 1977, pp, 61-102,
Rao, G. 8,, * Performance Analysis of Cache Memories,” Journ, of Assoc, of Camp. Mach,, vol. 25,
no. 3, 1978, pp. 378-395.
Rau, B.R., and Rossman, G. E., “The Effect of Instruction Fetch Strategies upon the Performance of
Pipelined Instruction Units,” Proc. 44 Ann, Symp. Computer Architecture, EEE 77CH 1182-5C,
1977, pp. 80-89,
Raw, B. R., “Program Behavior and the Performance of Interleaved Memories,” /EEE Trans. on
Comp., C-28, no. 3, Mar. 1979, pp. 191-199,
Reilly, J., Sutton, A.. Nasser, R., and Griscom, R.. ‘Processor Controller for the EBM 3081, (BAP
Journ, of Res. and Dee, vol. 26, no. 1. Jan., pp. 22-29.
Rice, L, Matrix Computations and Mathematical Software, McGraw-Hill, New York, 1981,
Ritchie, D. M., and Thompson, K., “The UNEX Time Sharing System.” Comm. of ACM, vol, 17, July
1974, pp. 365-375.
Robinson, J. T., "Some Analysis Techniques for Asynchronous Multiprocessor Algorithms,” (EEE
Trans. on Soft, Eng., Jan. 1979, pp. 24-30.
Rodrique, G., ed., Purallel Computations, Academic Press, New York. 1982.
Rodrique, G., Giroux, E. D., and Pratt, M.. ” Perspective on Large-Seale Scientific Computations.”
IEEE Comp., Oct. 1980, pp. 65-80.
Roesser, R. P,, * Two-Dimensional Microprocessor Pipelines for Image Processing,” /EEE Trans, on
Comp., Feb. 1978, pp. 144-36.
Rohrbacher, D., and Potter, J. L.. “Image Processing with STARAN Parallel Computer.” [ERE
Comp., Aug.. pp. 54-39,
Rosene, A. F..** Memory Allocation for Multiprocessors,” EEE Trans, an Electronic Computers, vol.
16, no, 5, Oct. 1967, pp. 659-665,
Rumbaugh, J., “A Data Flew Multiprocessor,” (EEE Trans. on Comp., C26, a0. 2, Feb. 1977,
pp. 138-146,
Russeli, E. C., “Automatic Program Analysis,” Pa.D. Thesis, Dep. of Electrical Engineering, Univ. of
Calif., Los Angeles, 1969.
Russefl, R. M., “The Cray-1 Computer System.” Comm. of ACM, Jan. 1978, pp. 63-72.
Salizer, J. H., “A Simple Linear Model of Demand Paging Performance,” Comm. of ACM, voi. 17,
April 1974.
Salezer, J. H., and Schroeder, M. D., “The Protection of Information in Computer Systems,” Proc,
fEEE, Sept. 1975, pp. 1238-1308.
Sameh, A. H.. “Numerical Parallel Algorithms--A Survey.” High-Speed Computers and Algorithm
Organization, Kuck, et ai., eds., Academic Press, 1977, pp. 207-228.
Sastry, K. V., and Kain, R. Y.. “On the Performance of Certain Multiprocessor Computer Organiza-
tions,” (EEE Trans. on Comp., vol. C-24, no. LE. Nov. 1975, pp. 1066- 1074.
Satyanarayanan, M.. Multiprocessors: A Comparative Study, Prentice-Hall. Englewood Cliffs, NJ.
1980.
$28 BIBLIOGRAPHY
Schaefer, 9. H.. “Spatially Parallel Architectures: An Overview," Computer Design, Aug. 1982,
pp. 117-124.
Schmid, H. A., “On the Efficient Implementation of Conditional Critical Regions and the Construction
of Monitors,” Acta informatica 6, Springer-Verlag, 1976, pp. 227-249,
Schwartz, J. T., “Ultra-Computers,"” ACM Trans, Programming Lunguages and Systems, vol. 2, no. 4,
1980, pp. 484-521,
Senzig, D. N., and Smith, R. V., “Computer Organization for Array Processing,” 4F7PS FICC Proc.
(part D, 1965, pp. 117-428,
Sethi, A. S., and Deo, N., “Interference in Muldprecessor Systems with Localized Memory Access
Probabilities," (EEE Trans. on Comp., vol, C-28, no. 2, Feb. 1979.
Shapiro, H. D., ‘A Comparison of Various Methods for Detecting and Utilizing Parallelism ina Single
Instruction Stream,” Proc. 1977 Intl. Conf. Parallel Proc., TEEE No, 77CH 1253-4C, 1977, pp.
67-76.
Shar, L. E.,** Design and Scheduling of Statistically Configured Pipelines,” Digital Sysiems, Lab Report
SU-SEL-72-042, Stanford University, Stanford, Calif., Sept, 1972.
Shar, L. E., and Davidsos, E. §,, “A Multiminiprocessor Sysiem Implemeated Fhrough Pipelining,”
1EEE Comp., Peb, 1975, pp. 42-51,
Shen, J. P., and Hayes, J. P., “Fault Tolerance ofa Class of Connecting Networks,” Proc. 77h Symp.
Computer Architecture, 1980, pp. 61-71.
Shoshani, A., and Coffman, E. G., Jr., “Sequencing Tasks in Multiprocess, Multiple Resource Systems
to Avoid Deadlocks,” Proc. [lik dns, Symp. Switching and dutamuta Theary, Oct, (970, pp.
225.233.
Siegel, H. }.. “A Madel of SIMD Machines and a Comparison of Various Interconnection Networks,”
(EEE Trans. on Comp., vol. C-28, no. 12, Deo. 1979a, pp. 907-917,
Siegel, H. J., “‘Interconnection Networks for SIMD Machines,” [EEE Comp,, June 1979b, pp. 57-
65,
Siegel, H. J., “The Theory Underlying the Partitioning of Permutation Networks,” /EEE Trans. on
Conmp., vol. C-29, no, 9, Sept. 1980, pp. 791-800.
Siegel, H. J.. frrerconnection Networks for Large-Scale Parallel Processing: Theory and Case Stuales,
Lexington Books, Lexington, Mass, 1984.
Siewiorek, D. P., et al., *A Case Study of Cimmp, Cm* and C.vmp, Part I. Experience with Fault-
Tolerance in Multiprocessor Systems,” Proc, EER, vol. 66, no. 10, Oct. 1978, pp. 1178-1199,
Siewiorek, D., Bell, C.G.,and Newell, A., Principles of Computer Structures, McGraw-Hill, New York,
1980,
Sites. R. L., “ Operating Systems and Computer Architecture,” in Jnéraduction to Computer Architecture
(Stone, ed.), SRA Inc,, 1980, pp, 591-643,
Slotnick, D. L., * Unconventional Systems,” Compurer Desiga. Dec. 1982, pp. 49-52.
Slotnick, D. L., Borck, W. C., and McReynolds, R. C., **The SOLOMON Computer,” Proe. of
AFIPS Fall foint Comp. Conf., Wash. D.C., 1962, pp. 97-107.
Smith, A. J.. “A Modified Working-Set Paging Algorithm,” JEEE Trans. on Comp., C-29, Sept, 1976,
907-914, :
Smith, A. £., “Multiprocessor Memory Organization and Memory Interference,” Conun. of ACM,
vol. 20, no. 10, Oct. 1977, pp. 754-761.
Smith, A. L, “A Comparative Study of Set-Associative Memory Mapping Algorithms and Their use for
Cache and Main Memory,” (EEE Trans. Soft. Eng., vol. SE-4, March (978, pp. 121-130.
Smith, A. J., “Cache Memories,” 4CM Computing Surreys, vol. 14, 20. 3, Sept, 1982, pp. 473-536.
Smith, 3. J..°A Pipelined Shared Resources MIMD Computer,” Proc, 1978 ined Conf. on Parallel
Proc., 1978, pp. 6-8.
Smith, B. J., “Architecture and Applications of the HEP Multiprocessor Computer System,” Real
Time Signal Processing £V, vol, 298, Aug. 1981.
Smith, J. W., “Cooperation and Competition: An Approach to Parallel Computation.” Proceedings
af Souiheasion £979, Roanoake, Virg., Apr. 1979.
Snyder, L., “Tstroduction to the Configurable Highly Parallel) Computer,” /EEE Comp., Jan. 1982,
pp. 47-64.
WWW.Gitmgurgaon.blogspot.com
BIBLIGGRAPHY $29
WWW.Gitmgurgaon.blogspot.com
BIBLIOGRAPHY $31
Wu, C. L., and Feng, T. Y., * Universicifity of the Shuffle Exchange Network,” /EEE Trans. on Com-
puters, vol. C-30, no. 5, May 1981, pp. 324-331.
Wulf, W, A., and Bell, C. G., ““C.mmp—A Multi-eminiprocessor,” Proc. AFIPS Fall Joint Computer
Conference, vol. 4], AFIPS Press, Montvale, N.J., 1972, pp. 765-777.
Wulf WA. et al. “Overview of the Hydra Operating System,” in Proc, of the Sth Symp. an Operating
System Principles, ACM, Nov. 1975, Austin, pp, 122-131.
Wulf, W. A,, Levin, R., and Harbison, 8. P., HYDRA/C mmp > An Experimental Computer System,
McGraw-Hill, N-Y., 1985.
Yau, S. §., and Fung, H. 5., “Associative Processor Architecture—A Survey,’ 4CM Computing
Surveys, March 1977, vol. 9, no. 1, pp. 3-28.
Yeh, P.C.,"* Shared Cache Organization for Multiple-Stream Computer Systems,” Coordinated Science
Lab., Univ. of Il, Tech. Rep. R-904, Jan. 1984.
Yeh, P.C., Patel, J. H., and Davidson, E. S., “Shared Cache for Multiple Stream Computer Systems,”
IEEE Trans, on Comp., vol. C-32, no. £, Jan, 1983, pp. 38-47.
INDEX
Access, privileged, 584 Amdahl 470/V7, 111
Access conflict (see Conflict, memory) Amdahl 470/V8, 111, 113, 115, 580
Access matrix, 587 Amdahl 580, 113
Access network, 53 Anderson, D, W., 229
Access privileges, 69 Andrews, G. R., 551
Access time, 13, 53, 56-57 Anticipatory fetch, 113
Accumulator register (ACAR), 401, 409 AP-120B, 154, 196, 235, 249-258, 319
Ackerman, W. B., 807 AP-190L, 258
Acknowledge signal, 484489 APEX mode, 255, 494, 495
Activity store, 31 Arbitration network, 749, 753
Actus, 301, 441-444, 454 Arithmetic and logic unit (ALU), 9, 170, 326
Adder, 12, 170 Arithmetic control unit, 381
Address: Arithmetic controller, 261
local, 330, 463 Arithmetic element (AE), 411~417
physical, 479 Arithmetic pipeline, 164-181, 240, 243,
vector, 215 798-801
Address mapping (see Mapping, address) Arithmetic unit (AU), 243
Address offset, 415 ARPA network, 400, 402
Address pipe, 270 Array:
Address register, 328 expression statement, 417
Address-save register, 268 sorting of, 361-367
Address space, 61 Array control unit (ACU), 326-327, 424, 426,
Address trace, 82 428
Address translation process, 70, 72-73 Array pipeline, 181-187
Addressing: Artay processor (see SIMD computers)
cache, 111-112 Artificial intelligence, 45, 375, 530
interleaved, 58-60, 101 Arvitid’s dataflow machine, 745, 757-759,
virtual (see Virtual addressing) 807-808
Addressing fault, 62 Ascending sort, 386
Advance instruction station (ADVAST), 402 ASCII (American Standards Committee on
Aerodynamics simulation, 44-45 Information Interchange), 119
Age registers, 116 Assignment decision, 590
Agerwala, T., 551 Associative array processing, 374-388
Ahuja, §. R., 551 Associative mapping. 64
Algol, 3, 355, 440, 544 Associative memory (AM}, 25, 57-58, 64,
Algorithm: 374-380, 385
decomposition of, 453 Associative processor, 25, 325, 375, 380-385,
MIMD, 613-616 393-396,
partitioned, 790-803 Astrophysics, 43
SIMD, 325-388, 453, 545, 614, 616-628 Asymmetry in multiprocessors, 471
synchronized, 614, 616-622 Asynchronous algorithm, 614, 622-428
Algorithm restructuring, 545 Asynchronous communication, 332
Alignment network, 327, 412, 413 Asynchronous parallelism, 20
Always prefetch, 113 Asynchronous variables, 680
Amamiya, M., 760, 808 Attached processor, 235-237, 249-264,
Amber Operating System, 668 327-328
Amdahl 470/V6, 77, 111, 234 Attached Support Processor System (IBM), 460
Cray-1, 4, 14, 22, 154, 154, 189, 194, 216, 228, Decimation-in-frequency (DIF) technique,
235, 264-280, 319 367-368
Cray-2, 4, 27, 235, 728 DECNET, 430
Cray Operating System (COS), 718 Decomposition, 615-616, 619
Cray X-MP, 4, 27, 235, 262, 280, 477, 480, Degree of multiprogramming (DOM), 87
714-728 Delta network, 494-508, 554, 754
architecture, 715-716 Demand feteh, 82, 113
performance analysis, 721-728 Demultiplexer, 494
Critical section, 539, 54G, 548, 551 Denelcor, Inc. , 670-676, 729
Cross-interrogate, 521 Denning, P., 80, 551
Crossbar network, 25, 336, 338, 342, 502 Dennis, J. B., 50, 732, 748-749, 807
Crossbar switch, 143, 415, 469, 471, 473, Dependence-driven, 764-765
487-492, 495, 504, 513, 517, 548 Despain, A., 468
CA access memory organization, 162 Deterministic scheduling, 592
Cumulative distribution function (CDF), 592, Device driver, 118
606-016 Diagnostic mode, 409
Cyber-170 (CDC), 459, 473, 478, 526 Dias, D. M., 551
Cyber-205 (CDC), 3, 4, 15, 22, 153, 154, 216, Dijkstra, E. W., 552
235, 280293, 454 Direct-access storage device (DASD}), 53
Cycle: Direct mapping, 64, 102-103
pipeline, 20 Direct-memory access (DMA}, 12, 122-123,
simple and greedy, 206 128, 134-136, 239, 249, 385
Cycle ratio, 520 Disk, 53-55, 119
Cycle stealing, 12 electronic (CCD, MBM), 55
Cycle time memory, 13-14, 160 Dispatcher subsystem, 91, 345-350, 354, 374
Distributed Array Processor (DAP), 37,
394-395
Daisy chaining, 482, 484 Distributed control, 333
Data buffer unit, 193-196 Distributed logic, 381
Data dependency, 7, 21, 29, 164, 233, $42 Distributed processing, 7, 27
Data-dependent hazard, 201 Distribution network, 749, 753
Data-driven organization, 734 . Divide loop, 189
Data flow computers, 20, 29-31, 735-737, Domain of instruction, 203
748-763 Dorr, F. W., 221
advantages, 745-746 Dot product, 213
Data flow graphs, 29, 740-742 DR-780, 430
Data flow languages, 740, 761 Drain time, 513
Data link, 430 \ Dubois, M., 229, 551
Data-manipuiator network, 336, 347351 Dynabus, 705~707
Data processing, 46 Dynamic address translation (DAT), 68
Data rate, average, 161 Dynamic coherence check, 521
Data routing, 327~328, 330-332, 344 Dynamic data flow computers, 737, 755-763
Data-routing register, 328 Dynamic decomposition, 615-616
Data stream, 32 Dynamic network, 333-339
Data token, 735 Dynamic priority algorithm, 485
Data transfer rate, 14 Dynamic relocation, 64
Data transmission (communication), 119
Data transmit pipeline, 273
Database machine, 16, 385, 396, 425 Economic modeling, 43, 410
Database-management system, 385 EDDY dataflow machine, 760-761
Davidson, E. §., 154, 208, 229, 321, 851 EDVAC (Electronic Discrete Variable
Davis, A. L., 868 Automatic Computer), 3
Deadlock, 526, 577~583 Efficiency, 151, 315, 445, 446, 550
prevention, 580-582 El-Ayat, K. A., 134, 136, 141
recovery, 582-583 Element memory control, 381
INDEX 837
WWW.Gitmgurgaon.blogspot.com
838 INDEX
Greedy cycle, 206 IBM 3081 system, 4, 27, 234, 259, 471, 477,
Greedy strategy, 203 690-692
Grohoski, G. R., 229 IBM 3084 system, 471
Guardian-Expand Network system, 713 IBM 3838 cormputer, 154, 235, 249, 259-262,
Guardian operating system, 712-713 328
Gurd, I., 762~763, 808 IBM 4341 computer, 259
IBM 7094 computer, 233
IBM AP configuration, 686
Habermann, A. N., 551 IBM MP configuration, 686
Handler, W., 37, 50, 181, 229 TBM operating system, 693-694
Handier’s classification, 137-140 ED (Irvine Dataflow) Language, 740
Handshaking, 125 Identifier, unique, 61
Hansen, P. B., 552 Lliac-IV computer, 3, 25, 37, 234, 328, 394,
Harbison, S. P., 728 397-410, 426, 440, 441, 444, 447, 448, 452,
Hayer, J. P., 49, 229 454
Hazard, detection and resclution, 200-203 Image processing, 375, 430, 433, 454
Hediund, K. $., 787, 808 Implicit parallelism, 530
Hellermann, H., 163, 229 Independent reference mode! (IRM), 83
HEP (Denelcor), 4, 27, 262, 477, 669-684 Index increment, 302
architecture, 670-672 Indexing:
Hierarchical memory, 12-13 global vs. local, 332
Higbie, L. C., 320 row-major, 362-363
Hintz, R. G., 229, 321 Indicator register, 375
Histogramming, 546, 548 Individual stage controi, 349
Hit, cache, 98-102, 112-113 Information processing, 4-6
Hit ratio, 56, 98, 525 Initial index, 302
Hoare, C. A. R., 552 Initiation interval set, 209
Hockney, R. W., 49, 320, 388 Input/output (1/0):
Holes, memory, 75 asymmetricity, 471
Home memory, 509, 510 overlapping, 12, 18, 233
Honeywell 60/66, 459, 471, 473, 474, 522 private, 483
Horizontal processing, 226-227 Input/output-bound (1/O-bound), 18, 769
Host computer, 327-328, 396, 410, 425 Input/output (1/0) channel, 10, 385, 471, 490
Hsiao, D., 385 Input/output (1/O) control, 429, 709
Hu, T. C., 602, 604 Input/output (1/0) controler, 709-712
Hwang, K., 50, 180, 218, 220, 230, 321, 388, Input/output (I/O) devices, 14
454, 551, 637, 729, 766, 789, 791-794, 805, Input/output (I/O) interface, 425, 428, 460
807-808 Input/output (I/O) interrupt, 120-129
Hydra Operating System, 650-454 Tnput/output (1/0) processor (independent),
410, 428
Input/output (I/O) register, 424
IBM 303X computer, 285 Input/output (1/O) subsystem, 8-9, 118-141,
IBM 360/85, 106 259, 716
IBM 360/91, 3, 11, 154, 164-181, 191-198, 234, Input selector (8), 334
236, 249, 259 Instruction:
IBM 370/158, 103 control-type, 327
IBM 370/168, 4, 9-10, 14, 77-78, 118, 427, decoding, 150
478, 480, 492, 528, 685-687 execution, 31
TBM 370/195, 234, 236, 249, 259 fetch, 150
IBM 701 electronic calculator, 3 prefetch, 187-193
IBM 1620 computer, 3 range and domain of, 201
IBM 2938 computer, 259 scalar, 327
IBM 3033 system, 14, 111, 115, 118, 687-690 Instruction buffer, 194
IBM 3035 computer, 235, 249, 259 Instruction cycle, 20-21
INDEX 839
WWW.Gitmgurgaon.blogspot.com
840 INDEX
MOPS (million operations per second), Next instruction parcel (NIP), 269
235-236 Ni, L. M., $0, 454, 581
Most processed first (MPF), 211 Nonblocking crossbar, 487
Most work remaining first (MWRF), 211 Noncacheable data, 520
Motooka, T., 764 Noncompute delay, 208
MSI (medium-scale integrated) circuits, 3 Nonlookahead algorithms, 92-93
MSIMD (multiple-SIMD), 397, 448, 452-453 Nonmaskable interrupt, 126
MSISD (noultiple-SISD), 34 Not-equal-to search, 385
Mueller, P. T., 369, 388 Not-greater-than search, 386
Multi-access memory (see Associative memory) Not-smaller-than search, 386, 387
Multi-Associative Processor (MAP), 448 NP-hard problems, 468
Multichip carrier (MCC), 296 Nuclear reactor safety, 48
Multidimensional access (MDA), 381, 383, 424 Numerical Aerodynamic Simulation Facilities
Multifunctionality, 11-12 (NASF) computer, 45, 293
Multiplexor, 128-132, 474
Multiplication matrix algorithms, 355-359, 435,
796 Object, data and resource, 201
Multiplier recoding, 180 Oceanography, 43
Multiport memory, 25, 487, 490, 492, 517 Octets, 243
Multiprocessing, 6-7 Odd-even merge sort, 363-364
Multiprocessor, 25-27, 31, 459-460, 468, 471, Oleinick, P. H., 657-658, 728
526, 531, 591-605, 614, 643-645 Omega network, 336, 350-354, 373
commercial systems, 644-645 OMEN, 395
evolution of, 4 On-line mode, 424
exploratory systems, 643-644 One block lookahead (OBL), 97-98
interconnections for, 25 One-sided network, 336
operating systems for, 525 Operating system, 409, 453, 525-527, 529, 531,
scheduling of, 390-602 550-551, 693-694
software requirements for, 528 classification of, 526
Multiprocessor anomalies, 605 requirements for, 331
Multiprocessor systems, 459 Optimization, 305-314
Multiprogramming, 3, 6-7, 16-19 Order statistics, 611
Multistage network, 334-339, 344, 354, Ordered retrieval, 386
373-374, 492-493 OS/V82 (IBM), 693-694
MUTEXBEGIN and MUTEXEND, 539 Out queue, 466
Mutual exclusion, 532, 539, 558 Output selector (OS), 334
Overlapping, I/O and CPU, 12, 18, 233
Overlay, 61
Naming (compiler), 61
NASA Ames Research Center, 394, 444
N-cube network, 336, 343-344, 352 PACK instruction, 302
Nearest below search, 386 Packet-switched bus, 466
Network: Packet switching, 333
control structure, 336, 339, 344 Packet switching network, 673-674
interconnection, 327-328, 332-354, 373-374, Padua, D. A., 50
388, 434, 453, 460, 481, 487, 513, 519, Page fault, 91-96, 549
520 rate of, 77-78, 84-86
inter-PE communication, 327 (See also Cache miss)
passive vs. dynamic, 333 Page-fault frequency (PFF) replacement,
topology, 334-354 95-96
(See also specific network name) Page segmentation, 77-80
Network access device (NAD), 290 Page size, 79-80
Newton-Raphson iteration, 413 Page table, 65
Newton’s iteration, 516, 618, 622-624 Paging, 65-71
WWW.Gitmgurgaon.blogspot.com
842 INDEX
WWW.Gitmgurgaon.blogspot.com
844 INDEX
Supervisor call (SVC}, 478, $29 Time division multiplexing (TDM), 485
Supervisor mode, 478 Time interval (space-time diagram), 151
Swain, P., 46 Time-shared common bus, 25
Swapping, 97 Time sharing, 3, 6-7, 16-19
Switch, 332-333, 753, 781-783 Time slice, 18
Switch box, 336, 339, 344, 352 Time-space span, 151
Switch lattice, 781-783 Token, 758, 763
Switching: Tomasulo, R. M., 230
circuit and packet, 333 Topology, register, 426
inter-PE, 332-333 TRADEC, 3
Synchronization, 558~565, 679-690 Tranqual, 440
Synchronization primitives, 480, 540 Transistors, 3
Synchronous communication, 332 Translation lookahead beffer, 111
Synonym problem, 111 Translation lookaside buffer (TLB), 64, 99,
Syntax-parsing phase, 305 Lil
Syre, J. C., 763, 808 Transportation sort, 363
System manager, 410 Travel time, 256
Systolic arrays, 334, 769-786 Tree machine, 468
reconfigurable, 780-786 Tree task system, 224-225
Treleaven, P. C., 808
Triadic operation, 413
Table conflict, 527 Triangular Enear system, 797-798
Tagged prefetch, 113 True ratio, 297
Tagged tokens, 738 Tuning, 313
Tagging, 761, 762 Two-sided network, 336
Takahashi, N., 760, 868
TAL (Tandem Algorithmic Language), 705
Fandem Nonstop System, 27, 705-713 UART (Universal Asynchronous Receiver-
architecture, 707 Transmitter), 123
Task graph, 222-226, 602-609 U-interpreter, 745
Task scheduling {vector processing), 218-229 Unfolding, 744
Tate, D. P., 229, 321 Unger machine, 394
Template (data flow computer}, 31 Unibus, 9
Temporary register, 375 Uniform tree, 495
Terminal index, 302 Uniprocessor, 8-19
Tesier, L. G., 807 Univac, 477, 492, 701-702
Test-and-set, 480, 559 Univac 1100 series, 694-695, 701-705
Theis, D. J., 320 Univae 1160/80, 4, 27, 471, 522, 696-700
Thermal conduction module (TCM), 690 Univae 1160/90, 492, 761
Thomas, A. T., 229 Univae multiprocessors, 694-704
Thomas, R. E., 388, 756, 868 Unix, 528
Thompson, C. D., 363, 365, 388 Unmapped local memory (ULM), 469, 471
Thrashing, 8&7 Unpack instruction, 302
3-cube connected-cycle network, 334 Update, memory, 113-115
3-cube network, 334, 342, 358 Upper broadcast state, 336, 352
Three-stage network, 343 User mode, 478
Threshold search, 386 Utilization, 35, 445, 446, 532
Throughput, 10, 151, 315, 445, 483, 519
Thurber, K. J., 388, 395
TI-ASC computer, 3, 35, 38, 151, 154, 183, V semaphore, 565-568
192, 194, 196, 212, 215, 233, 242-249 Vacuum tubes, 3
Tightly coupled systems (TCS), 35, 460, 468, VAL (Value Algorithmic Language), 740
480, 508, 510, 547 Variable:
‘Time complexity, 355-357, 362 asynchronous, 680
WWW.Gitmgurgaon.blogspot.com
846 INDEX