Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
ARCHITECTURE
MODULE – 1
Parallel Computer Models
Text Book:
K. Hwang and Naresh Jotwani, Advanced Computer
Architecture, Parallelism, Scalability,
Programmability, TMH, 2010.
Page No:
Mod I 1-29,36-40,44-52,108-111
Mod II 133-167
Mod III 281-312
Mod IV 318-322,324-334,227-240
Mod V 240-273
Mod VI 408-444,458-465
2
Computer System Architecture
Based on number of processors
Single processor Systems
One main CPU executing general purpose instructions
Multiprocessor Systems (Parallel Systems)
Has two or more processors in close communication and sharing
the computer memory, peripherals and bus
Increased throughput(Performance)
Economy of scale
Increased reliability
Clustered Systems
Two or more systems are coupled together
Increased availability
Parallel vs. Distributed System
Parallel Systems
Multiple processors have direct access to the shared
memory.
Tightly coupled system
Distributed Systems
Collection of independent computers, interconnected via
network
Distributed Memory
Loosely coupled system
Parallel Computing
6
A modern computer system consists of computer hardware,
instruction sets, application programs, system software and
user interface
Computing Problems
Numerical computing
Transaction processing
Logical reasoning
Algorithm and Data Structure
Hardware Resources
Operating Systems
System Software Support
Compiler Support
Preprocessor
Precompiler
Parallelizing compiler
Five Generations of Electronic Computers
1940 – 1956: First Generation – Vacuum Tubes
1956 – 1963: Second Generation – Transistors
1964 – 1971: Third Generation – Integrated Circuits
(Semiconductor)
1972 – 2010: Fourth Generation – Microprocessors
2010- : Fifth Generation – Artificial Intelligence
(parallel processing and superconductors.)
8
Flynn’s Taxonomy
– The most universally excepted method of classifying
computer systems
Scalar Processor - Processes only one data item at a time
Vector Processor - A single instruction operates
simultaneously on multiple data items
12
Cont…
Implicit vector- software controlled instruction
processing.
Explicit vector- hardware controlled instruction
processing.
Two families of pipelined vector processors:
Memory-to-memory architecture supports the
pipelined flow of vector operands directly from the
memory to pipelines and then back to the memory.
Register-to-register architecture uses vector registers
to interface between the memory and functional
pipelines.
14
There are two major classes of parallel computers,
namely, shared-memory multiprocessors and message-
passing multicomputers.
The processors in a multiprocessor system
communicate with each other through shared variables
in a common memory.
Each computer node in a multicomputer system has a
local memory, unshared with other nodes.
Interprocessor communication is done through
message passing among the nodes.
Six layers for computer system development.
16
System attributes to performance
Performance Factors
Ic - Number of machine instructions to be executed in the program
p- Number of processor cycles needed for instruction decode and
execution
m – Memory reference needed
k – Ratio between memory cycle and processor cycle
t – Processor Cycle time
System Attributes
The above five performance factors (Ic, p, m, k, t) are influenced by
four system attributes:
Instruction-set architecture
compiler technology
CPU implementation and control
cache and memory hierarchy.
Clock Rate and CPI
The inverse of the cycle time is the clock rate (f = 1/t).
Cycles per instruction (CPI) – used for measuring the
time needed to execute each instruction.
T = lc * (p + m * k) * t
MIPS Rate
Let C be the total number of clock cycles needed to
execute a given program. Then the CPU time can be
estimated as T = C * t = C/ f.
CPI = C/ lc and T = lc * CPI * t = lc * CPl / f
19
Throughput Rate
How many programs a system can execute per unit
time, called the system throughput Ws(in
programs/second).
In a multiprogrammed system, the system throughput
is often lower than the CPU throughput Wp.
Wp MIPS *106 / Ic
Calculate number of instructions in S1 and S2.
Calculate CPI
Programming Environments
Sequential environment
Parallel environment
Implicit Parallelism
Uses a conventional language to write the source
program. The sequentially coded source program is
translated into parallel object code by a
parallelizing compiler.
This compiler must be able to detect parallelism
and assign target machine resources. This compiler
approach has been applied in programming shared
memory multiprocessors.
This approach requires less effort on the part of the
programmer.
23
Explicit Parallelism
Requires more effort by the programmer to develop a
source program using parallel dialects
Parallelism is explicitly specified in the user programs. This
reduces the burden on the compiler to detect parallelism.
The compiler needs to preserve parallelism and, where
possible, assigns target machine resources.
24
25
MULTIPROCESSORS AND MULTI COMPUTERS
1. Shared-Memory Multiprocessors -The three
shared-memory multiprocessor models:
28
The UMA Model
The physical memory is uniformly shared by all the processors.
All processors have equal access time to all memory words,
which is why it is called uniform memory access.
Each processor may use a private cache. Peripherals are also
shared in some fashion.
29
Multiprocessors are called tightly coupled
systems due to the high degree of resource sharing.
The system interconnect takes the form of a
common bus, a crossbar switch, or a multistage
network.
The UMA model is suitable for general-purpose
and timesharing applications by multiple users.
When all processors have equal access to all
peripheral devices, the system is called a
symmetric multiprocessor.
30
In an asymmetric multiprocessor, only one or a
subset of processors are executive- capable.
An executive or a master processor can execute the
operating system and handle I/O.
The remaining processors have no I/O capability
and thus are called attached processors (APs).
Attached processors execute user codes under the
supervision of the master processor.
31
The NUMA Model
Shared-memory system in which the access time
varies with the location of the memory word.
The shared memory is physically distributed to all
processors, called local memories.
The collection of all local memories forms a global
address space accessible by all processors.
32
33
It is faster to access a local memory with a local
processor.
The access of remote memory attached to other
processors takes longer due to the added delay through
the interconnection network.
Globally shared memory can be added to a
multiprocessor system.
Three memory-access patterns: The fastest is local
memory access. The next is global memory access.
The slowest is access of remote memory.
34
The processors are divided into several clusters.
Each cluster is itself an UMA or a NUMA
multiprocessor. The clusters are connected to
global shared-memory modules. The entire
system is considered a NUMA multiprocessor. All
processors belonging to the same cluster are
allowed to uniformly access the cluster shared
memory modules.
All clusters have equal access to the global
memory. However, the access time to the cluster
memory is shorter than that to the global memory.
35
The COMA Model
A multiprocessor using cache-only memory
36
A special case of a NUMA machine, in which the
distributed main memories are converted to caches.
There is no memory hierarchy at each processor
node.
All the caches form a global address space.
Remote cache access is assisted by the distributed
cache directories(D).
Depending on the interconnection network used,
sometimes hierarchical directories may be used to
help locate copies of cache blocks.
Initial data placement is not critical because data
will eventually migrate to where it will be used.
37
Limitations
Multiprocessor systems are suitable for general-
purpose multiuser applications where
programmability is the major concern.
A major shortcoming of multiprocessors is the lack
of scalability .
Latency tolerance for remote memory access is
also a major limitation.
38
Distributed-Memory Multicomputer
The system consists of multiple computers, often
called nodes, interconnected by a message passing
network.
Each node is an autonomous computer consisting
of a processor, local memory , and sometimes
attached disks or I/O peripherals.
No-remote-memory-access (NORMA)
lnternode communication is carried out by passing
messages through the static connection network.
39
Multicomputer Generations
Modern multicomputers use hardware routers to
pass messages.
A computer node is attached to each router. The
boundary router may be connected to I/O and
peripheral devices.
Message passing between any two nodes involves
a sequence of routers and channels.
Heterogeneous multicomputer - The internode
communication in a heterogeneous multicomputer
is achieved through compatible data
representations and message-passing protocols.
41
Commonly used static topologies to construct
multicomputers include the ring, tree, mesh, torus,
hypercube, cube-connected cycle, etc .
42
Taxonomy of MIMD Computers
43
MULTIVECTOR AND SIMD COMPUTERS
Classification of supercomputers
Pipelined vector machines
SIMD computers
44
Vector Supercomputers
45
A vector computer is often built on top of a
scalar processor.
The vector processor is attached to the scalar
processor as an optional feature.
Program and data are first loaded into the main
memory through a host computer.
All instructions are first decoded by the scalar
control unit.
If the decoded instruction is a scalar operation or
a program control operation, it will be directly
executed by the scalar processor using the scalar
functional pipelines.
46
If the instruction is decoded as a vector
operation, it will be sent to the vector control
unit.
This control unit will supervise the flow of
vector data between the main memory and vector
functional pipelines.
The vector data flow is coordinated by the
control unit.
A number of vector functional pipelines may be
built into a vector processor.
47
Vector Processor Models
Register-to-register architecture
Vector registers are used to hold the vector
operands, intermediate and final vector results.
The vector functional pipelines retrieve operands
from and put results into the vector registers.
A memory-to-memory architecture differs from a
register-to-register architecture in the use of a vector
stream unit to replace the vector registers.
Vector operands and results are directly retrieved
from and stored into the main memory in
superwords.
48
SIMD Supercomputers
49
SIMD Machine Model
SIMD computer is specified by a 5-tuple:
M = (N, C, I, M, R)
50
CONDITIONS OF PARALLELISM
51
Data Dependence
The ordering relationship between statements is
indicated by the data dependence.
Five types of data dependence are defined below:
(1) Flow dependence: A statement S2 is flow-
dependent on statement S1 if an execution path
exists from S1 to S2 and if at least one output of S1
feeds in as input to S2. Flow dependence is denoted
as S1 S2.
(2) Antidependence: Statement S2 is antidependent
on statement S1 if S2 follows S1 in program order
and if the output of S2 overlaps the input to S1. A
direct arrow crossed with a bar as S1 S2 indicates
antidependence from S1 to S2.
52
(3) Output dependence: Two statements are output-dependent if
they produce the same output variable. S1 S2 indicates
output dependence from S1 to S2.
(4) I/O dependence: Read and write are I/O statements. I/O
dependence occurs not because the same variable is involved
but because the same file is referenced by both I/O statements.
(5) Unknown dependence: The dependence relation between two
statements cannot be determined in the following situations:
The subscript of a variable is itself subscribed.(indirect
addressing)
The subscript does not contain the loop index variable.
A variable appears more than once with subscripts having
different coefficients of the loop variable.(ie different functions of
the loop variable).
The subscript is nonlinear in the loop index variable.
53
Control Dependence
This refers to the situation where the order of
execution of statements cannot be determined before
run time.
The successive iterations of the following loop are
control-independent:
Do 20 I = 1, N
A(I) = C(I)
IF (A(I). LT. 0) A(I) = 1
20 Continue
The following loop has control-dependent iterations:
Do 10 I = 1, N
IF (A(I - 1).EQ. 0) A(I) = 0
10 Continue
54
Resource Dependence
Resource dependence is concerned with the
conflicts in using shared resources.
When the conflicting resource is an ALU, we call
it ALU dependence.
If the conflicts involve workplace storage, we call
it storage dependence.
In the case of storage dependence, each task
must work on independent storage locations or
use protected access to shared writable data.
55
Bernstein's Conditions
In 1966, Bernstein revealed a set of conditions
based on which two processes can execute in
parallel.
A process is a software entity corresponding to the
abstraction of a program fragment defined at
various processing levels.
We define the input set Ii of a process Pi as the set
of all input variables needed to execute the process
(fetched from memory or registers).
The output set Oi consists of all output variables
generated after execution of the process Pi
(results).
56
Consider two processes P1 and P2 with their input sets I1 and I2 and
output sets O1 and O2, respectively.
These two processes can execute in parallel and are denoted P1 II P2 if
they are independent and therefore create deterministic results.
58
Bernstein's Conditions
In general || relation is,
Commutative;
Not transitive;
Associativity;
59
Example
60
Fig. 2.2 Detection of parallelism in the program of Example 2.2
61
62
Hardware and Software Parallelism
Modern computers requires special hardware and
software support for parallelism.
Hardware Parallelism
Type of parallelism defined by the machine
architecture and hardware multiplicity.
Hardware parallelism is often a function of cost and
performance tradeoffs.
It displays the resource utilization patterns of
simultaneously executable operations.
It can also indicate the peak performance of the
processor resources.
63
Hardware Parallelism
One way to characterize the parallelism in a processor
is by the number of instruction issues per machine
cycle. If a processor issues k instructions per machine
cycle, then it is called a k-issue processor.
A conventional pipelined processor takes one machine
cycle to issue a single instruction. These types of
processors are called one-issue machines, with a single
instruction pipeline in the processor.
In a modern processor, two or more instructions can be
issued per machine cycle.
64
Software Parallelism
This type of parallelism is revealed in the program
flow graph.
Software parallelism is a function of algorithm,
programming style, and program design.
Types of software parallelism:
Control and Data parallelism
65
The first is control parallelism, which allows two or more
operations to be performed simultaneously.
Control parallelism, appearing in the form of pipelining or
multiple functional units, is limited by the pipeline length
and by the multiplicity of functional units.
70
Use Bernstein’s condition to detect maximum
parallelism in the given code
S1: A=B+C
S2: C=D+E
S3: F=G+E
S4: C=A+F
S5: M=G+C
S6: A=L+C
S7: A=E+A
Assignment 1