17CS72 Mod 1 PPT

Visit : https://hemanthrajhemu.github.
io
Join Telegram to get Instant Updates: https://bit.ly/VTU_TELEGRAM
Contact: MAIL: futurevisionbie@gmail.com
INSTAGRAM: www.instagram.com/hemanthraj_hemu/
INSTAGRAM: www.instagram.com/futurevisionbie/
WHATSAPP SHARE: https://bit.ly/FVBIESHARE

Advanced Computer Architecture
17CS72
MODULE-1
Theory of Parallelism
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”,

Hwang & Jotwani
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 1
MODULE-1
Syllabus: MODULE Theory of Parallelism
Parallel Computer Models 1

The State of Computing, Multiprocessors and
1
Multicomputer
Multivector and SIMD Computers 1
PRAM and VLSI Models, Program and Network
Properties 1
1 Conditions of Parallelism 1
Program Partitioning and Scheduling 1
Program Flow Mechanisms, System Interconnect
1
Architectures
Principles of Scalable Performance 1
Performance Metrics and Measures, Parallel Processing 1
Applications
Speedup Performance Laws, Scalability Analysis and 1
Approaches
10/1/2020 2
Theory of Parallelism
10/1/2020 3
THE STATE OF COMPUTING: Computer Development Milestones
• How it all started…
 500 BC: Abacus (China) – The earliest mechanical computer
/calculating device.
Operated to perform decimal arithmetic with carry propagation digit by digit
 1642: Mechanical Adder/Subtractor (Blaise Pascal)
 1827: Difference Engine (Charles Babbage)
 1941: First binary mechanical computer (Konrad Zuse; Germany)
 1944: Harvard Mark I (IBM)
The very first electromechanical decimal computer as proposed by Howard Aiken
• Computer Generations
 1 st 2 nd 3 rd 4 th 5 th
 Division into generations marked primarily by changes in
hardware and software technologies
10/1/2020 4
THE STATE OF COMPUTING: Computer Generations
• First Generation (1945 – 54)

o Technology & Architecture:
• Vacuum Tubes
• Relay Memories (Switch kind latch)
• CPU driven by PC and accumulator
• Fixed Point Arithmetic
o Software and Applications:
• Machine/Assembly Languages
• Single user
• No subroutine linkage
• Programmed I/O using CPU
o Representative Systems: ENIAC, Princeton IAS, IBM 701
10/1/2020 5
• Second Generation (1955 – 64)

• Discrete Transistors (Single Transistors)
• Core Memories
• Floating Point Arithmetic
• I/O Processors
• Multiplexed memory access
• High level languages used with compilers
• Subroutine libraries
• Batch Processing Monitor
o Representative Systems: IBM 7090, CDC 1604, Univac LARC
10/1/2020 6
• Third Generation (1965 – 74)

• IC Chips (SSI/MSI)
• Microprogramming
• Pipelining
• Cache
• Look-ahead processors
• Multiprogramming and Timesharing OS
• Multiuser applications
o Representative Systems: IBM 360/370, CDC 6600, T1-ASC, PDP-8
10/1/2020 7
• Fourth Generation (1975 – 90)

• LSI/VLSI
• Semiconductor memories
• Multiprocessors
• Multi-computers
• Vector supercomputers
• Multiprocessor OS
• Languages, Compilers and environment for parallel
processing
o Representative Systems: VAX 9000, Cray X-MP, IBM 3090, BBN TC
2000
10/1/2020 8
• Fifth Generation (1991 onwards)

• Advanced VLSI Processors
• Memory and Switches
• Scalable Architectures
• High Density Packaging
• Superscalar Processors
• Systems on a chip
• Massively parallel processing
• Grand challenge applications
• Heterogeneous processing
o Representative Systems: S-81, IBM ES/9000, Intel Paragon, nCUBE
6480, MPP, VPP500
10/1/2020 9
THE STATE OF COMPUTING: Elements of Modern Computers
• The hardware, software, and programming elements of modern computer

systems can be characterized by looking at a variety of factors, including:
– Computing problems
– Algorithms and data structures
– Hardware resources
– Operating systems
– System software support
– Compiler support
10/1/2020 10
Computing problems
• A modern computer is an integrated system consisting of :

– Machine hardware
– An instruction set
– System software
– Application programs
– User interfaces.
• The use of a computer is driven by real-life problems demanding cost
effective solutions.
• For numerical problems in science and technology, the solutions demand
complex mathematical formulations and intensive integer or floating-point
computations.
• For artificial intelligence [Al] problems, the solutions demand logic inferences
and symbolic manipulations.
10/1/2020 11
Algorithms and Data Structures
• Traditional algorithms and data structures are designed

– For sequential machines.
• New, specialized algorithms and data structures are needed
– To exploit the capabilities of parallel architectures.
• These often require interdisciplinary interactions
– Among theoreticians, experimentalists and programmers.
10/1/2020 12
Hardware Resources
• Processors, memory, and peripheral devices form the hardware core of a

computer system.
• A modem computer system demonstrates its power through coordinated
efforts by :
– Hardware recourses
– An operating system
– Application software.
• Special hardware interfaces are often built into l-O devices such as display
terminals, workstations.
• In addition, software interface programs are needed. These software
interfaces include :
– File transfer systems, editors, word processors, device drivers, interrupt handlers,
network communication programs, etc.
10/1/2020 13
Operating System
• Operating systems manage the allocation and deallocation of resources during user
program execution.
• An OS plays a significant role in mapping hardware resources to algorithmic and data
structures.
– Implementation of Mapping:
• Relies on efficient Compiler and Operating system support.
• Parallelism can be exploited at Algorithm :
– Design Time
– Program Time
– Compile Time
– Run Time
10/1/2020 14
System Software Support
• System software is needed for the development of efficient programs in HLL.

• Few system software's are: Compliers, Loaders, Linkers, OS Kernel
• Parallel software can be developed:
– Using entirely new languages designed specifically with parallel support
– Using extensions to existing sequential languages.
• New languages have obvious advantages:
– Like new constructs specifically for parallelism
– But require additional programmer education and system software.
• The most common approach is to extend an existing language.
10/1/2020 15
Complier Support
There Complier upgrade approaches:

1. Pre-processors
– Use existing sequential compilers and specialized libraries to implement parallel
constructs
2. Pre-compilers
– Perform some program flow analysis, dependence checking, and limited parallel
optimizations
3. Parallelizing Compilers
– Requires full detection of parallelism in source code, and transformation of
sequential code into parallel constructs
• Compiler directives are often inserted into source code to aid compiler
parallelizing efforts
10/1/2020 16
THE STATE OF COMPUTING: Evolution of Computer Architecture
• The study of computer architecture involves both the following:

o Hardware organization
o Programming/software requirements
• The evolution of computer architecture is believed to have started with von
Neumann architecture
o Built as a sequential machine
o Executing scalar data (Single value)
• Major leaps in this context came as…
o Look-ahead, parallelism and pipelining
o Flynn’s classification
o Parallel/Vector Computers
o Development Layers
10/1/2020 17
10/1/2020 18
Look-ahead, parallelism and pipelining
• Look-ahead
– Introduced to pre-fetch Instruction to overlap Instruction Fetch & Execute
– They enable function parallelism
• Parallelism
– Supported by two approaches:
• Use of multiple functional units simultaneously.
• Practicing pipelining at various processing levels Includes
– Pipelined Instruction execution
– Pipelined arithmetic computations
– Memory access operations
• Pipelining
– Performs identical operations repeatedly on vector data strings
– Uses Scalar pipeline processors
10/1/2020 19
Flynn’s classification
• This taxonomy distinguishes multi-processor computer architectures according to two
independent dimensions :
• Instruction stream
– Sequence of instructions executed by machine
• Data stream.
– Sequence of data including input, partial or temporary results used by instruction stream
• Each of these dimensions can have only one of two possible states:
– Single or Multiple
• Flynn’s classification depends on the distinction between :
– The performance of control unit
– The data processing unit rather than its operational and structural interconnections.
• Following are the four category of Flynn classification:
– SISD(Single instruction stream, single data stream)
– SIMD(Single instruction stream, multiple data stream)
– MIMD(Multiple instruction stream, multiple data stream)
– MISD(Multiple instruction stream, single data stream)
10/1/2020 20
Flynn’s classification SISD(Single instruction stream, single data stream)
Flynn’s classification SIMD(Single instruction stream, multiple data stream)
10/1/2020 21
Flynn’s classification MIMD(Multiple instruction stream, multiple data stream)
Flynn’s classification MISD(Multiple instruction stream, single data stream)
10/1/2020 22
Parallel/Vector Computers
Parallel computers are those that execute programs in MIMD mode.
• There are two major classes of parallel computers
– Shared-memory multiprocessors
– message-passing multicomputers.
• The major distinction between multiprocessors and multicomputers lies :
– Memory sharing
• Multiprocessors: Processors communicate through shared variable in a common memory
• Multicomputer: Each computer has a local memory unshared with other computers.
– Inter-processor communication
• Through message passing for a multi computer
Vector Processors:
• Equipped with multiple vector pipelines that can be concurrently used under
hardware or firmware control
• Two families :
– Memory-to-memory architecture
– Register-to-register architecture
10/1/2020 23
Development Layers
Six layers of computer system development:
Programmers View
Point these 2 layers
should be architecture
transparent
This features are up to

the designer  should
match the target
application domain
As a good computer architect One has to approach problem from both the ends
10/1/2020 24
THE STATE OF COMPUTING: System Attributes Affecting Performance
• The ideal performance of a computer system demands a perfect match between

machine capability and program behavior.
• Performance depends on:
– Hardware technology
– Architectural features
– Efficient resource management
– Algorithm design
– Data structures
– Language efficiency
– Programmer skill
– Compiler technology
10/1/2020 25
Performance Indicators:
• The simplest measure of program performance is the turnaround time , which
includes :
– Disk and memory accesses
– Input and output
– Compilation time
– Operating system overhead
– CPU time
• Since I/O and system overhead frequently overlaps processing by other programs,
it is fair to consider only :
– The CPU time used by a program.
– The user CPU time is the most important factor.
10/1/2020 26
Clock Rate and CPI:

• CPU is driven by a clock with a constant cycle time τ (usually measured in
nanoseconds).
• The inverse of the cycle time is the clock rate (f = 1/ τ , measured in megahertz).
• The size of a program is determined by its instruction count, Ic , the number of
machine instructions to be executed by the program.
• Different machine instructions require different numbers of clock cycles to execute.
CPI (cycles per instruction) is thus an important parameter.
Average CPI:
• It is easy to determine the average number of cycles per instruction for a particular
processor if we know the frequency of occurrence of each instruction type.
• Of course, any estimate is valid only for a specific set of programs (which defines the
instruction mix), and then only if there are sufficiently large number of instructions.
• In general, the term CPI is used with respect to a particular instruction set and a given
program mix.
10/1/2020 27
Performance Factors (1):

• The time required to execute a program containing Ic instructions is just
T = Ic * CPI * τ.
• Each instruction must be  fetched from memory  decoded  then operands
fetched from memory  the instruction executed  the results stored.
• The time required to access memory is called the memory cycle time,
– Which is usually k times the processor cycle time t.
– The value of k depends on the memory technology and the processor-memory
interconnection scheme.
Performance Factors (2):
• The processor cycles required for each instruction (CPI)is the total of
– cycles needed for instruction decode & execution (p)
– cycles needed for memory references (m * k)
• The total time needed to execute a program can then be rewritten as
(p + m * k) * τ
T = Ic *https://hemanthrajhemu.github.io
10/1/2020 28
System Attributes:
The five performance factors (Ic , p, m, k, τ) are influenced by four system

attributes:
• Instruction-set architecture (affects Ic and p)

• Compiler technology (affects Ic and p and m)
• CPU implementation and control (affects p × τ)
• Cache and memory hierarchy (affects memory access latency, k × τ)
Total CPU time can be used as a basis in estimating the execution rate of a
processor.
10/1/2020 29
System Attributes:
10/1/2020 30
MIPS Rate:
If C is the total number of clock cycles needed to execute a given
program =>
• Then total CPU time can be estimated as:
– T = C × τ = C / f.
• Other relationships are easily observed:
– CPI = C / Ic
– T =Ic × CPI × τ
– T =Ic × CPI / f
Processor speed is often measured in terms of millions of
instructions per second, frequently called the MIPS rate of the
processor
10/1/2020 31
MIPS Rate:
The MIPS rate is:

 Directly proportional ==> To the clock rate
 Inversely proportion to the CPI.
Note:
All four system attributes (instruction set, compiler, processor, and memory
technologies) affect the MIPS rate, which varies also from program to program.
10/1/2020 32
Throughput Rate:
The number of programs a system can execute per unit time,
Ws, in programs per second.
CPU throughput, Wp, is defined as :
Wp = f/(Ic * CPI)
In a multiprogrammed system, the system throughput is often
less than the CPU throughput.
10/1/2020 33
Programming Enviroments:
• Programmability depends on the :
– Programming environment provided to the users.
• Conventional computers are used in a sequential programming
environment
– With tools developed for a uniprocessor computer.
• Parallel computers need parallel tools:
– That allow specification or easy detection of parallelism
– Operating systems that can perform parallel scheduling of:
• Concurrent events
• Shared memory allocation
• Shared peripheral
• Communication links.
10/1/2020 34
Approach of Parallelism
Source Program  Sequential coded
• Complier has to detect parallelism

assign machine resources
• Used in shared memory
multiprocessors
• Success relies on the intelligence
of the parallelizing complier
10/1/2020 35
Approach of Parallelism
Source Program  Parallelism is

explicitly specified in the user program
• Complier needs to preserve

Parallelism assign machine
resources
10/1/2020 36
Multiprocessors and Multicomputers
Categories of Parallel Computers:
• Considering their architecture only, there are two main categories of
parallel computers:
– Systems with shared common memories(Shared Memory Multiprocessors)
• The UMA Model
• The NUMA Model
• The COMA Model
• The CC-NUMA Model
– Systems with unshared distributed memories(Distributed-Memory
Multicomputers)
• The NORMA Machines
• Message Passing multicomputers
– Taxonomy of MIMD Computers
– Representative Systems
• Multiprocessors: BBN TC-200, MPP, S-81, IBM ES/9000 Model 900/VF,
• https://hemanthrajhemu.github.io
Multicomputers: Intel Paragon XP/S, nCUBE/2 6480, SuperNode 1000, CM5, KSR-1
10/1/2020 37
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors( Uniform Memory Access Model UMA Model)
• Physical memory uniformly shared by
all processors
– Has equal access time to all words.
• Processors may have local cache
memories.
• Peripherals also shared in some
fashion.
• Tightly coupled systems use a :
– common bus, crossbar, or multistage
network to connect processors,
peripherals, and memories.
• Many manufacturers have
multiprocessor (MP) extensions of
uniprocessor (UP) product lines.
10/1/2020 38
Shared Memory Multiprocessors(UMA Model)
• Synchronization and communication among

processors achieved through shared variables in
common memory.
• Two types of Multiprocessors:
• Symmetric MP systems :
– All processors have access to all peripherals
– Any processor can run the OS and I/O device
drivers.
• Asymmetric MP systems :
– Not all peripherals accessible by all processors
– Kernel runs only on selected processors (master)
– Others are
called attached processors (AP).
10/1/2020 39
Example: Performance Calculation:

Consider two loops:(Sequential Processor)
• The first loop  adds corresponding elements of two N-
element vectors to yield a third vector.
• The second loop  sums elements of the third vector.
• Assume each add/assign operation
– Takes 1 cycle,
– And ignore time spent on other actions (e.g. loop counter
incrementing/testing, instruction fetch, etc.).
• Assume interprocessor communication
– Requires k cycles.
• On a sequential system, each loop(I/J) will require N
cycles
• Hence a total of 2N cycles of processor time is required
10/1/2020 40
Example: Performance Calculation::(Parallel Processor)

• To execute a program On an M-processor system,
• We can partition each loop into M parts,
– Each having L = N / M (for each Loop section).
• As we have 2 loops The total time required is thus 2L.
– This leaves us with M partial sums that must be totalled.(Through Recursion Concept)
• Computing the final sum from the M partial sums requires
– To merge all partial sums of l level binary tree takes l= log2(M) additions
– The addition of each pair of partial sum requires k cycles through shared memory.
– 1 cycle (for the add/assign),
– Hence Total cycles  l × (k+1) cycles.
• The parallel computation thus requires
2N / M + (k + 1) log2(M) cycles
10/1/2020 41
Example: Performance Calculation:

• Assume N = 220.
• Sequential execution requires 2N = 221 cycles.
• If processor synchronization requires k = 200 cycles, and we have M = 256
processors, parallel execution requires
2N / M + (k + 1) log2(M)
= 221 / 28 + 201 × 8
= 213 + 1608 = 9800 cycles
• Comparing results, the parallel solution is 214 times faster than the
sequential, with the best theoretical speedup being 256 (since there are
256 processors).
• Thus the efficiency of the parallel solution is 214 / 256 = 83.6 %
10/1/2020 42
Shared Memory Multiprocessors( Non-Uniform Memory Access( NUMA ) Model )
• Shared memories, but access time depends on the

location of the data item.
• The shared memory is distributed among the :
– As local memories, but each of these is still
accessible by all processors (with varying access
times).
• Memory access is fastest from the :
– Locally-connected processor, with the
interconnection network adding delays for other
processor accesses.
• Additionally, there may be global memory in a
multiprocessor system:
– With two separate interconnection networks,
one for clusters of processors and their cluster
memories, and another for the global shared
memorieshttps://hemanthrajhemu.github.io
10/1/2020 43
Shared Memory Multiprocessors(NUMA Model:Hierarchical Cluster Model)
10/1/2020 44
Shared Memory Multiprocessors(Cache Only Memory Access Model -COMA)
• In the COMA model,

– Special kind of NUMA model
– Distributed memories of NUMA have
converted to cache memories;
– The caches, taken together, form a
global address space.
• Each cache has an associated
directory
– Remote machines in their lookups;
– Hierarchical directories may exist in
machines based on this model.
• Initial data placement is not critical, as
cache blocks will eventually migrate to
where they are needed
10/1/2020 45
Shared Memory Multiprocessors
(Cache Coherent –Non Uniform Memory Access CC-NUMA)
It uses the concept of both:

1. COMA
2. NUMA
10/1/2020 46
Distributed Memory Multicomputers
• Multicomputer consist of multiple
computers/nodes, interconnected by a
• Each node is autonomous, with its own

processor and local memory, and
sometimes local peripherals.
• The message-passing network provides
point-to-point static connections among
the nodes.
• Local memories are not shared,
– So traditional multicomputer are called
no-remote-memory access (or NORMA)
machines.
• Inter-node communication is achieved
by passing messages through the static
connection network.
10/1/2020 47
Multicomputer Generations
• Each multicomputer uses:
– Routers and channels in its interconnection network,
– Heterogeneous systems may involved mixed node types
– Uniform data representation and communication protocols.
• First generation:
– Hypercube architecture
– Software controlled message switching
– Processor boards
• Second generation:
– Mesh-connected architecture
– Hardware message switching
– Software for medium-grain distributed computing
• Third generation:
– Fine-grained distributed computing, with each VLSI chip containing the processor and
communication resources
10/1/2020 48
Multiprocessors and Multicomputer
Taxonomy of MIMD Computers
Parallel Computers
Parallel Computers : Exists
 SIMD or MIMD configuration
 SIMD configuration
 For special purpose applications
 CM - 2 (Connection Machine) on SIMD architecture
 MIMD configuration
 CM - 5 on MIMD architecture
 Having globally shared virtual address space
 Scalable multiprocessors or multicomputer
 Use distributed shared memory
 Un-scalable multiprocessors:
 Use centrally shared memory
10/1/2020 49
Taxonomy of MIMD Computers
Scalable
• Single Address Space

• Shared Memory
• Computation
Not Scalable
Scalable
10/1/2020 50
Multivector and SIMD Computers
Supercomputer (Have highest operational rate) Classification:
 Pipelined Vector machine/ Vector Supercomputers:

 Using a few powerful processors equipped with vector hardware
 Vector Processing
 SIMD Computers / Parallel Processors:

 Emphasizing massive data parallelism
10/1/2020 51
Vector computer  Is built on top of Scalar
Vector Supercomputers Processor.
10/1/2020 52
Vector Supercomputers
Step 1-2: Program & data are first loaded into the Main Memory through a Host
computer.
Step 3: All instructions are first decoded by the Scalar Control Unit.
Step 4: If the decoded instruction is a scalar operation or a program control operation, it

will be directly executed by the scalar processor using the Scalar Functional Pipelines.
Step 5: If the instructions are decoded as a Vector operation, it will be sent to the vector
control unit.
Step 6: Vector control unit will supervise the flow of vector data between the main
memory and vector functional pipelines.
Note: A number of vector functional pipelines may be built into a Vector Processor.
10/1/2020 53
Vector Supercomputers - 2 Models
Model 1:Register-to-register architecture
– A fixed number of possibly reconfigurable registers
• Which are used to hold all vector operands, intermediate, and final vector
results.
– All registers are accessible in user instructions
– All vector registers length will be fixed(e.g.64 bits)
Model 2: Memory-to-memory architecture
– Primary memory holds operands and results
– A vector stream unit accesses memory for fetches
– Stores into the main memory in large units superwords (e.g.512 bits)
10/1/2020 54
SIMD Supercomputers
Single Instruction Stream and over multiple data stream
10/1/2020 55
SIMD Supercomputers
An operational model of an SIMD computer is specified by a 5-tuple: (N, C, I, M, R)
(1) N = No. of Processing Elements (PE) in the machine.
(2) C =Set of instructions directly executed by the control unit (CU).

• Scalar & Program Flow Control Instructions.
(3) I = Set of instructions broadcast by the CU to all PEs for parallel execution.
• Include: Arithmetic, logic, data routing, masking,
• And other local operations executed by each active PE over data within that PE.
(4) M = Set of Masking Schemes

• Each mask partitions the set of PEs into enabled and disabled subsets for a given cycle.
(5) R = Set of data-routing functions

• Specifying various patterns to be set up in the interconnection network for inter-PE communications.
10/1/2020 56
PRAM and VLSI Models
Parallel Random Access Machines :
o Time and Space Complexities
• Time complexity
» It is a function of problem size
» Asymptotic Time complexity of an algorithm like Big-O etc.
» Best case  Worst case
• Space complexity
» It is a function of problem size S
» Asymptotic Space complexity refers to data storage of large problems
» Program and input data storage is not considered.
• Serial and Parallel complexity 
» Time complexity of an algorithm
» Parallel complexity should be lower serial complexity.
• Deterministic and Non-deterministic algorithm
» In deterministic  every operational step is uniquely defined.
» Operations resulting in one outcome set of possible outcomes.
10/1/2020 57
 Developed by Fortune and Wyllie (1978)
 Objective:
» Modeling idealized parallel computers with zero synchronization or memory

access overhead
 An n-processor PRAM has a globally addressable Memory
Each Processor can access any

memory location in unit time
10/1/2020 58
o NP Completeness:
» An algorithm has polynomial complexity  if there exits polynomial p(s) and its
time complexity O(p(s)) for a problem size  s.
» Set of problems having polynomial complexity algorithm by deterministic way is

called P-class (for polynomial class)
» Set of problems solvable by nondeterministic algorithms in polynomial time is

called NP-class (for nondeterministic polynomial class)
» P ⊂ NP
» P is solvable where as NP is not solvable
» Note: P= NP/ P≠ NPIs still an open problem
10/1/2020 59
PRAM Models
Conventional Uniprocessor computers
• Modelled as random access machines (RAM)
Parallel random access machines
• Modelled to idealize parallel computers with zero synchronization / memory access
overhead.
• Used  parallel algorithm development, scalability & complexity analysis.
• Globally addressable shared memory distributed among n Processor Elements (PEs) /
centralized.
• PEs operate on synchronized read memory, compute, write memory cycle.
Four memory-update options in PRAMs:
Exclusive Read (ER): Read takes  1 Cycle atmost 1 Processor
Exclusive Write (EW): Write takes  1 Cycle atmost 1 Processor
Concurrent read (CR):Read of a memory location takes  Same Cycle for multiple Processor
Concurrent Write (CW):Simultaneously Writes to a same memory location  conflict need to be
avoided
10/1/2020 60
PRAM Variants:- Classified based on Memory read/write : Four memory-update
1. EREW PRAM model:
 Prohibits more than 1 processor from reading or writing simultaneously to same cell.
2. CREW PRAM model:
 Mutual exclusion used to avoid write conflicts.
 Allows concurrent reads from same memory.
3. ERCW PRAM model:
 Allows exclusive reads/ concurrent writes from same memory.
4. CRCW PRAM model:
 Allows either concurrent reads/ concurrent writes from same memory.
10/1/2020 61
Discrepancy with Physical Models of PRAM Variants:-
 Most popular variants: EREW and CRCW
 SIMD machine with shared architecture is closest architecture modelled by PRAM
 PRAM Allows different instructions to be executed on different processors simultaneously.

Thus, PRAM really operates in synchronized MIMD mode with shared memory
10/1/2020 62
VLSI Complexity Model:-
 Parallel computers rely on the use of VLSI chip 
 To fabricate components such as Processor arrays, Memory arrays, Switching networks
etc.
 Nowadays, VLSI technologies are 2-dimensional.
 The size of a VLSI chip is proportional to the amount of storage (memory) space
available in that chip.
We deal with AT2 model (2-dimension)
 We can calculate the space complexity of an algorithm by :
 The chip area (A) of the VLSI chip implementation of that algorithm.
 If T is the time (latency) needed to execute the algorithm
 Then A.T gives an upper bound on the total number of bits processed through the chip (or
I/O).
 For certain computing, there exists a lower bound, f(s), such that
A.T2 >= O (f(s))
Where A=chip area and T=time
10/1/2020 63
Memory Bound on Chip Area:
 Many computations are memory bound has there will be a need in processing large data
sets
 Memory requirement of computation sets a lower bound on the chip area (A)
 Amount of information processed by the chip  can be visualized as information flow
upward across the chip area
Volume of rectangular AT: Product of AT
Note : As the information flows over the time :-

The number of input bits <= Volume AT
10/1/2020 64
Bisection Communication Bound,
 The bisection is represented by the vertical slice across the chip area A.
The height of the cross section is

The bisection area represent  the
maximum amount of information
exchange between the two halves of
the chip  during the time period T
The distance of this dimension is
10/1/2020 65
Chapter-2
Program and Network Properties
In this chapter we will concentrate on following topics:
10/1/2020 66
Conditions of Parallelism
Three key areas for significant progress in parallel processing:
 Computational models for parallel computing
 Inter-processor communication in parallel architectures
 System integration for incorporating parallel system
Condition of parallelism Includes:

o Data dependence and resource dependence
o Hardware and software dependence
o The role of compiler
10/1/2020 67
Data dependence and resource dependence
The ability to execute several program segments in parallel:
• Requires each segment to be independent of the other segments.
• We use a dependence graph to describe the relations.
• The nodes of a dependence graph correspond to the program statement
(instructions)
• Directed edges with different labels are used to represent the ordered
relations among the statements.
• The analysis of dependence graphs shows where opportunity exists for
parallelization.
10/1/2020 68
Data Dependence : The ordering relationship between statements is
indicated by the data dependence.
Five Types:
1. Flow dependence
2. Antidependence
3. Output dependence
4. I/O dependence
5. Unknown dependence
10/1/2020 69
1. Flow dependence : A statement S2 is flow dependent on S1 RAW
S1: ADD R1, R2, R3

ADD R1SUB R1
S2 : SUB R4, R1, R2
2. Antidependence : Statement S2 is antidependent on the statement

S1WAR
S1: ADD R1, R2, R3
S2 : SUB R4, R1, R2 S3 depends on S2 

i.e S3 should be executed after S2
S3: ADD R1, R2, R3
10/1/2020 70
3. Output dependence : Two statements are output dependent if they produce
(write) the same output variable. Also called WAW hazard and denoted as
S1: ADD R1, R2, R3

S3 depends on S1 
S2 : SUB R4, R1, R2
i.e S3 and S1 writes to R1
S3: ADD R1, R2, R3
4. I/O dependence : Read and write are I/O statements.

• I/O dependence occurs not because the same variable is involved but because the same file
referenced by both I/O statement.
10/1/2020 71
5. Unknown dependence: The dependence relation between two statements
cannot be determined in the following situations:
• The subscript of a variable is itself subscribed ( indirect addressing )
• The subscript does not contain the loop index variable
• A variable appears more than once with subscripts having different

coefficients of the loop variable
• The subscript is non linear in the loop index variable.
10/1/2020 72
Control Dependence: Dependency is know at Run time
Control Dependency  Prohibits Parallelism

10/1/2020 73
Resource Dependence:
– Data and control dependencies are based on the independence of the work
to be done.
– Resource independence is concerned with conflicts in using shared resources,

such as registers, integer and floating point ALUs, etc.
– ALU conflicts are called ALU dependence.
– Memory (storage) conflicts are called storage dependence.
10/1/2020 74
Bernstein’s Conditions - 1 : Bernstein’s conditions are a set of conditions which
must exist if two processes can execute in parallel.
– Notation
• Ii is the set of all input variables for a process Pi .
• Oi is the set of all output variables for a process Pi .
• If P1 and P2 can execute in parallel (which is written as P1 || P2), then:
10/1/2020 75
Bernstein’s Conditions - 2 : In terms of data dependencies, Bernstein’s conditions
imply that two processes can execute in parallel if they are:
– Flow-independent, anti-independent, and output-independent.
– The parallelism relation || is commutative 
(Pi || Pj implies Pj || Pi ),
– But not transitive 
(Pi || Pj and Pj || Pk does not imply Pi || Pk )
– Therefore, || is not an equivalence relation.
10/1/2020 76
Hardware and Software Parallelism
Hardware Parallelism : Hardware parallelism is defined by machine architecture
and hardware multiplicity.
– It can be characterized by the number of instructions that can be issued per
machine cycle.
– If a processor issues k instructions per machine cycle, it is called a k-issue
processor.
– Conventional processors are one-issue machines.
– A machine with n processors  with k-issue should be able to handle a
maximum of nk threads simultaneously.
– Examples:
• Intel i960CA is a three-issue processor (arithmetic, memory access, branch)
• IBM RS-6000 is a four-issue processor (arithmetic, floating-point, memory access,
branch)
10/1/2020 77
Hardware Parallelism :
10/1/2020 78
Hardware Parallelism :
10/1/2020 79
Software Parallelism :
– Software parallelism is defined by the control and data dependence of programs,
– Is revealed in the program’s flow graph.
– It is a function of algorithm, programming style, and compiler optimization.
10/1/2020 80
Types of Software Parallelism :
1. Control Parallelism:
• Two or more operations can be performed simultaneously.
• This can be detected by a compiler, or a programmer can explicitly indicate control
parallelism
– By using special language constructs or dividing a program into multiple
processes.
2. Data parallelism:
• Multiple data elements have the same operations applied to them at the same
time.
• This offers the highest potential for concurrency (in SIMD and MIMD modes).
• Synchronization in SIMD machines handled by hardware.
10/1/2020 81
The Role of Compilers :
– Compilers used to exploit hardware features to improve performance.
– Interaction between compiler and architecture design is a necessity in modern

computer development.
– It is not necessarily the case that more software parallelism will improve performance
in conventional scalar processors.
– The hardware and compiler should be designed at the same time.
10/1/2020 82
Program Partitioning & Scheduling
Program Partitioning & Scheduling :
Mainly deals with:-

1. Grain Sizes and Latency:
2. Grain Packing and Scheduling
3. Grain determination and scheduling optimization
10/1/2020 83
Program Partitioning & Scheduling(Grain Sizes and Latency)
Grain Sizes and Latency:

– Size of a program for parallel execution can vary
– The sizes are roughly classified using the term “granule size,” or simply “granularity”
– The simplest measure, for example, is the number of instructions in a program part
– Grain sizes are usually described as fine, medium or coarse, depending on the level of
parallelism involved
10/1/2020 84
Parallelism has exploited at various Processing levels:
10/1/2020 85
Instruction Level Parallelism:
– This fine-grained, or smallest granularity level typically involves less than 20

instructions per grain.
– The number of candidates (Programs) for parallel execution varies from 2 to

thousands, with about five instructions or statements (on the average) being
the average level of parallelism.
– Advantages:
• There are usually many candidates (Programs) for parallel execution
• Compilers can usually do a reasonable job of finding this parallelism
10/1/2020 86
Loop Level Parallelism:
– Typical loop has less than 500 instructions.
– If a loop operation is independent between iterations, it can be handled by a

pipeline, or by a SIMD machine.
– Most loop operations are self scheduled to execute on a parallel or vector

machine (MIMD Mode)
– Some loops (e.g. recursive) are difficult to handle.
– Loop-level parallelism is still considered fine grain computation.
10/1/2020 87
Procedure Level Parallelism:
– It is Medium-sized grain  usually less than 2000 instructions.
– Detection of parallelism is more difficult than with smaller grains
– Interprocedural dependence analysis is difficult and history-sensitive
– Communication requirement less than instruction-level
– SPMD (single procedure multiple data) is a special case
– Multitasking belongs to this level.
10/1/2020 88
Subprogram Level Parallelism:
– Job step level and related subprograms
– Grain typically has thousands of instructions medium- or coarse-grain level
– Job steps can overlap across different jobs
– Multiprograming conducted at this level
– No compilers available to exploit medium- or coarse-grain parallelism at

present.
10/1/2020 89
Job or Program-Level Parallelism:
– Corresponds to execution of essentially independent jobs or programs on a

parallel computer.
– This is practical for a machine with a small number of powerful processors,

but impractical for a machine with a large number of simple processors (since
each processor would take too long to process a single job).
10/1/2020 90
Final Summary :
– Fine-grain parallelism is exploited at instruction or loop levels
• Assisted by the compiler.
– Medium-grain parallelism is At task or job step level
• Requires programmer and compiler support.
– Coarse-grain parallelism is  At Program Level
• Relies heavily on effective OS support.
– Shared-variable communication used at fine- and medium-grain levels.
– Message passing can be used for medium- and coarse-grain communication
• But fine-grain really need better technique because of heavier

communication requirements.
10/1/2020 91
Communication Latency:
– Balancing granularity and latency can yield better performance.
– Various latencies attributes  To machine architecture, technology, and

communication patterns used.
– Latency imposes a limiting factor On scalability of machine size.
– Ex. Memory latency increases as memory capacity increases 

• Limiting the amount of memory that can be used with a given tolerance for
communication latency.
10/1/2020 92
Inter-process Communication Latency:
– Needs to be minimized by system designer
– Affected by signal delays and communication patterns
– Example:
• n communicating tasks may require 
• n (n - 1)/2 communication links
• The complexity grows quadratically , effectively limiting the number of processors

in the system.
10/1/2020 93
Program Partitioning & Scheduling(Grain Packing and Scheduling )
Grain Packing and Scheduling:
Two questions exists in parallel programming:
1. How can I partition a program into parallel “pieces” to yield the shortest
execution time?
2. What is the optimal size of parallel grains?
There is an obvious tradeoff between the:
• Time spent for scheduling and synchronizing parallel grains and the speedup
obtained by parallel execution.
• One approach to the problem is called “grain packing.”

10/1/2020 94
Program Partitioning & Scheduling(Grain Packing and Scheduling )
Grain Packing and Scheduling:
• The grain size problem requires determination of both
• The number of partitions
• The size of grains in a parallel problem
• Solution is through problem dependent and machine dependent in

parallelism.
• A short schedule is required for fast execution of subdivided program

modules.
10/1/2020 95
Program Partitioning & Scheduling
(Grain Determination and Scheduling Optimization)
Grain determination and scheduling optimization:
Includes 4 steps:
Step 1: Construct a fine-grain program graph
Step 2: Schedule the fine-grain computation
Step 3: Grain packing to produce coarse grains
Step 4: Generate a parallel schedule based on the packed graph
10/1/2020 96
Program Partitioning & Scheduling(Program Graphs and Packing)
A program graph is like a dependence graph
• Nodes= {(n, s)} where n= node name and s=
size
– Larger the s = larger grain size
– Edges= {(v,d)} where v= communicated variable
and d=communication delay
– Packing 2 or more nodes produces a larger grain

size node and possibly more edges to other nodes.
– Packing helps in eliminating unnecessary

communication delay or reduce overall scheduling
overhead.https://hemanthrajhemu.github.io
10/1/2020 97
• Nodes l, 2, 3, 4, 5, and 6 are memory reference
(data fetch} operations.
• Each takes one cycle to address and six cycles to
fetch from memory.
10/1/2020 98
– Each node in the program graph corresponds to a
computational unit in the program.
– The grain size is measured by the number of basic

machine cycle (including both processor and
memory cycles) needed to execute all the
operations within the node.
– Grain size reflects the number of computations

involved in a program segment.
– Fine-grain nodes have a smaller grain size. and

coarse-grain nodes have a larger grain size.
10/1/2020 99
Program Flow Mechanisms
There are three flow mechanism :
– Control flow mechanism(used by conventional computers)

• Order of program execution  Stated in the user program
– Data-driven mechanism(used by data-flow computers)

• Order of program execution  Is driven by data(operand) availability
– Demand-driven mechanism(used by reduction computers)

• Order of program execution  Based on the demand for its result by other
computations
10/1/2020 100
Control flow mechanism
Control flow mechanism(used by conventional computers):
– Conventional von Neumann computers use a 
• Program counter to sequence the execution of instruction in a program.
• This sequential execution style is called control-driven.
– Control flow computers use shared memory to hold program instructions and data
objects.
– A uniprocessor computers is inherently sequential  Use control-driven mechanism.
– Control flow can be made parallel by using parallel language constructs or parallel
compiler.
– Control flow machines give complete control, but are less efficient than other
approaches.https://hemanthrajhemu.github.io
10/1/2020 101
Data flow mechanism
Data-driven mechanism(used by data-flow computers):
– Dataflow computers emphasize a high degree of parallelism at the fine-grain

instructional level.
– Data flow machines have 
• High potential for parallelism and throughput and freedom from side effects,
• But have high control overhead, lose time waiting for unneeded arguments, and
difficulty in manipulating data structures.
– No need for  shared memory, program counter, control sequencer
– Special mechanisms are required to detect data availability, match data tokens
with instructions needing them, enable chain reaction of asynchronous instruction
execution https://hemanthrajhemu.github.io
10/1/2020 102
Demand flow mechanism
Demand-driven mechanism(used by reduction computers):
– Data-driven machines select instructions for execution based on the availability of

their operands  this is essentially a bottom-up approach.
• Eager Evaluation
– Demand-driven machines take a top-down approach, attempting to execute the

instruction (a demander) that yields the final result.
• This triggers the execution of instructions that yield its operands, and so forth.
• Lazy Evaluation
10/1/2020 103
Demand flow mechanism
Demand-driven mechanism(used by reduction computers): 2 Reduction Models:
1. String-reduction model:
– Each demander gets a separate copy of the expression string 

 To evaluate each reduction step has an operator
 Embedded reference to demand the corresponding input operands.
 Each operator is suspended while arguments are evaluated
2. Graph-reduction model:
– Expression graph reduced by evaluation of branches or subgraphs, possibly in parallel,

with demanders given pointers to results of reductions.
– Based on sharing of pointers to arguments; traversal and reversal of pointers

continues until constant arguments are encountered.
10/1/2020 104
10/1/2020 Department of Computer Science and Engg
105
System Interconnect Architectures
– Direct networks for static connections
– Indirect networks for dynamic connections
– Networks are used for 
• Internal connections in a centralized system among
• Processors
• Memory modules
• I/O disk arrays
• Distributed networking of multicomputer nodes
10/1/2020 106
– Static and dynamic networks are used
• For interconnecting computer subsystems
• For constructing multiprocessors or multicomputers
– Ideal: Construct a low-latency network with a high data transfer rate
10/1/2020 107
Network Properties and Routing
– Static Networks:
• Point-to-Point direct connections which will not change during program
execution
– Dynamic Networks:
• Implemented with switched channels.
• They are dynamically configured to match the communication demand in user

programs
10/1/2020 108
Goals and Analysis:
– The goals of an interconnection network are to provide

• Low-latency
• High data transfer rate
• Wide communication bandwidth
– Analysis includes
• Latency
• Bisection bandwidth
• Data-routing functions
• Scalability
of parallel architecture
10/1/2020 109
Terminologies used in Network Properties and Routing:
– Network usually represented by a graph :

• With a finite number of nodes
• Linked by directed or undirected edges.
– Number of nodes in graph = network size .
– Number of edges (links or channels) incident on a node  node degree d

(also note in and out degrees when edges are directed).
• Node degree reflects number of I/O ports associated with a node, and should
ideally be small and constant.
10/1/2020 110
– Diameter D of a network
• Is the maximum shortest path between any two nodes
• Measured by the number of links traversed
• This should be as small as possible (from a communication point of view)
– Channel bisection width b  minimum number of edges cut to split a

network into two parts each having the same number of nodes.
• Since each channel has w bit wires, the wire bisection width B = bw.
• Bisection width provides good indication of maximum communication
bandwidth along the bisection of a network,
• And all other cross sections should be bounded by the bisection width.
10/1/2020 111
– Wire (or channel) length  length (e.g. weight) of edges between nodes.
– Network is symmetric  if the topology is the same looking from any node;
these are easier to implement or to program.
10/1/2020 112
Data Routing Functions:Is used for inter-PE data exchange
– Shifting
– Permutation (one to one)
– Broadcast (one to all)
– Multicast (many to many)
– Personalized broadcast (one to many)
– Shuffle
– Exchange Etc.
10/1/2020 113
Data Routing Functions:Permutation (one to one)
– Given n objects, there are n ! ways in which they can be reordered (one of
which is no reordering).
– For example: π = (a , b , c) (d , e)
• Possible mapping are: ab, bc, ca, de, ed in a circular fashion.
• The cycle (a , b , c) has 3 periods and (d , e)has 2 periods. Hence

permutation π has a period=3 * 2 = 6
• If one has to be permuted then it has identity mapping
I = (a) (b) (c) (d) (e)

10/1/2020 114
Data Routing Functions:Permutation (one to one)
– A permutation can be specified by giving the rule for reordering a group of

objects.
– Permutations can be implemented using crossbar switches, multistage

networks, shifting, and broadcast operations.
– The time required to perform permutations of the connections between

nodes often dominates the network performance when n is large.
10/1/2020 115
Data Routing Functions:Perfect Shuffle and Exchange
– Perfect Shuffle is a special permutation function by Harold Stone
– The entries according to the mapping of the k-bit binary number
ab…k to bc…ka
– That is, shifting 1 bit to the left and wrapping it around to the least
significant bit position.
– The inverse perfect shuffle reverses the effect of the perfect shuffle.
10/1/2020 116
Data Routing Functions:Perfect Shuffle and Exchange
– Perfect Shuffle is a special permutation function by Harold Stone
Perfect Shuffle
(Left Shift)
Inverse Perfect
Shuffle
(Right Shift)
10/1/2020 117
Data Routing Functions:Hypercube Routing Functions
A 3-cube with nodes C2C1C0
10/1/2020 118
Data Routing Functions:Hypercube Routing Functions
A 3-cube with nodes C2C1C0
10/1/2020 119
Data Routing Functions:Broadcast and Multicast
– Broadcast is one-to-all mapping

• This can be achieved in SIMD computers.
• Message passing multicomputers also has mechanisms to broadcast.
– Multicast is one-to-many mapping
• Is implemented by matching the destination nodes in the network.
10/1/2020 120
Data Routing Functions:Network Performance
– Functionality:-
• How the network supports data routing, interrupt handling, synchronization,
request/message combining, and coherence.
– Network latency:-
• Worst-case time for a unit message to be transferred.
– Bandwidth
• Maximum data rate
– Hardware complexity
• Implementation costs for wire, logic, switches, connectors, etc.
– Scalability
• How easily does the scheme adapt to an increasing number of processors,
memories, etc.?
10/1/2020 121
Static Connection Networks  Static networks use direct links which are fixed
once built.
Few topologies are:

– Linear Array
– Ring and Chordal Ring
– Barrel Shifter
– Tree and Star
– Fat Tree
– Mesh and Torus
10/1/2020 122
Static Connection Networks  Linear Array
– It is one-dimensional network where 
• N nodes connected by n-1 links in a line.
– Internal nodes have degree 2;
– End nodes have degree 1.
– Diameter = n-1
– Bisection = 1
– For small n, this is economical, but for large

n, it is obviously inappropriate.
10/1/2020 123
Static Connection Networks  Ring, Chordal
Ring
– Like a linear array, but the two end nodes

are connected by an n th link;
– The ring can be uni-directional or

bidirectional.
– Diameter is n/2 for a bidirectional ring, or n

for a unidirectional ring.
10/1/2020 124
Ring
– By adding additional links (e.g. “chords” in a

circle)
– The node degree is increased, and we

obtain a chordal ring.
– This reduces the network diameter.
– In the limit, we obtain a fully-connected

network, with a node degree of n -1 and a
diameter ofhttps://hemanthrajhemu.github.io
1.
10/1/2020 125
Ring
Other example for Chordal Ring
10/1/2020 126
Static Connection Networks  Barrel Shifter
Barrel shifter is obtained from Ring  By

adding extra lines from each node to those
nodes having a distance equal to an integer
powers of 2
10/1/2020 127
Static Connection Networks  Barrel Shifter
Barrel shifter is obtained from Ring  By

adding extra lines from each node to those
nodes having a distance equal to an integer
powers of 2
10/1/2020 128
Static Connection Networks  Tree and Star
– A k-level completely balanced binary tree will have N = 2k – 1 nodes,
– With maximum node degree of 3 and network diameter is 2(k – 1).
– The balanced binary tree is scalable, since it has a constant maximum node
degree.
10/1/2020 129
Static Connection Networks  Tree ,Star and Fat Tree
– A star is a two-level tree with a node degree d = N – 1 and a constant diameter
of 2.
– A fat tree is a tree in which the number of edges between nodes increases
closer to the root
– The edges represent communication channels (“wires”).
10/1/2020 130
Static Connection Networks  Mesh and Torus
– Pure mesh:
• N = nk nodes with links between each adjacent pair of nodes in a row or

column (or higher degree).
• Interior node degree d = 2k, diameter = k (n – 1).
• This is not a symmetric network

– degree and diameter doesn’t exist as per the above requirement
10/1/2020 131
Static Connection Networks  Mesh and Torus
– Illiac mesh (used in Illiac IV computer):

• Wraparound is allowed  thus reducing the network
diameter to about half that of the equivalent pure mesh.
– A torus mesh :
• Has ring connections in each dimension, and is symmetric.
• An n × n binary torus has node degree of 4 and a diameter

of 
10/1/2020 132
Static Connection Networks  Systolic Array
– A systolic array is an arrangement of processing elements and communication

links designed specifically to match the computation and communication
requirements of a specific algorithm (or class of algorithms).
– This specialized character may yield better performance than more

generalized structures, but also makes them more expensive, and more
difficult to program.
10/1/2020 133
Static Connection Networks  Hypercubes
– A binary n-cube architecture with N = 2n nodes spanning along n dimensions,

with two nodes per dimension.
– The hypercube scalability is poor, and packaging is difficult for higher-

dimensional Hypercubes
10/1/2020 134
Static Connection Networks  Cube-connected Cycles
– 3-cube connected cycles (CCC)  can be created from a 3-cube by replacing

each corner nodes (vertex) of the 3 cube by a ring of 3 nodes.
– k-cube connected cycles (CCC)  can be created from a k-cube by replacing

each vertex of the k dimensional hypercube by a ring of k nodes.
10/1/2020 135
Static Connection Networks  Network Throughput
– Network throughput:- number of messages a network can handle in a unit

time interval.
– One way to estimate is to calculate the maximum number of messages that

can be present in a network at any instant (its capacity)  Throughput usually
is some fraction of its capacity.
– A hot spot is a pair of nodes that accounts for a disproportionately large

portion of the total network traffic (possibly causing congestion).
– Hot spot throughput is maximum rate at which messages can be sent

between two specific nodes.
10/1/2020 136
Dynamic Connection Networks 
– Dynamic connection networks can implement all communication patterns

based on program demands.
– In increasing order of cost and performance, these include

• Bus systems
• Multistage interconnection networks
• Crossbar switch networks
– Price can be attributed to the cost of wires, switches, arbiters, and connectors.
– Performance is indicated by network bandwidth, data transfer rate, network

latency, and communication patterns supported.
10/1/2020 137
Dynamic Connection Networks  Bus Systems
– A bus system (contention bus, time-sharing bus):

• Has a collection of wires and connectors
• Multiple modules (processors, memories, peripherals, etc.) which connect
to the wires
• Data transactions between pairs of modules
– Bus supports only one transaction at a time.
– Bus arbitration logic must deal with conflicting requests.
– Lowest cost and bandwidth of all dynamic schemes.
– Many bus standards are available.

10/1/2020 138
– A bus is the simplest type of dynamic interconnection networks.
– It constitutes a common data transfer path for many devices.
– Depending on the type of implemented transmissions we have serial
busses and parallel busses.
– The devices connected to a bus can be processors, memories, I/O units, as
shown in the figure below.
10/1/2020 139
– The throughput of the network based on a bus  can be increased by the use
of a multibus network shown in the figure below.
– In this network, processors connected to the busses can transmit data in
parallel (one for each bus) and many processors can read data from many
busses at a time.
10/1/2020 140
Dynamic Connection Networks  Switch Modules
– An a × b switch module has a inputs and b outputs.
– A binary switch has a = b = 2.

• It is not necessary for a = b, but usually a = b = 2k , for some integer k.
10/1/2020 141
Dynamic Connection Networks  Switch Modules
– In general, any input can be connected to one or more of the outputs.
– However, multiple inputs may not be connected to the same output. When
only one-to-one mappings are allowed, the switch is called a crossbar switch.
10/1/2020 142
Dynamic Connection Networks  Multistage Interconnection Networks
– In general, any multistage network is comprised of a 

• Collection of a × b switch modules and fixed network modules.
• The a × b switch modules are used to provide variable permutation or

other reordering of the inputs, which are then further reordered by the
fixed network modules.
– A generic multistage network consists of a sequence alternating dynamic

switches (with relatively small values for a and b) with static networks (with
larger numbers of inputs and outputs). The static networks are used to
implementhttps://hemanthrajhemu.github.io
interstage connections (ISC).
10/1/2020 143
Single-stage networks:
• Perfect shuffle
operation: cyclic shift
1 place left, eg 101 -->
011
10/1/2020 144
– Single-stage networks:Since a single-stage network was not sufficient to get

the desired output, a multistage is built.
– Omega Network : An example of a multistage network is the omega network.
– The capability of single stage networks are limited 

• But if we cascade enough of them together, they form a completely connected
MIN (Multistage Interconnection Network).
– Switches can perform their own routing or can be controlled by a central

router
10/1/2020 145
10/1/2020 146
This type of networks can be classified into the following four categories:
– Nonblocking : A network is called strictly nonblocking if it can connect any
idle input to any idle output regardless of what other connections are
currently in process
– Rearrangeable nonblocking : In this case a network should be able to

establish all possible connections between inputs and outputs by rearranging
its existing connections.
– Blocking interconnection: A network is said to be blocking if it can perform

many, but not all, possible connections between terminals. Example: the
Omega network
10/1/2020 147
Omega Networks:
A 2 × 2 switch can be configured for Straight-through, Crossover, Upper broadcast
(upper input to both outputs), Lower broadcast (lower input to both outputs)
– With four stages of eight 2 × 2 switches, and a static perfect shuffle for each of
the four ISCsA 16 by 16 Omega network can be constructed (but not all
permutations are possible).
10/1/2020 148
Baseline Networks:
– A baseline network can be shown to be topologically equivalent to other
networks (including Omega), and has a simple recursive generation procedure.
– Stage k (k = 0, 1, …) is an m × m switch block (where m = N / 2k ) composed
entirely of 2 × 2 switch blocks, each having two configurations: straight
through and crossover.
– The network can be generated recursively
– The first stage N × N, the second (N/2) × (N/2).
– Networks are topologically equivalent if one network can be easily reproduced
from the other networks by simply rearranging nodes at each stage
10/1/2020 149
Baseline Networks:
10/1/2020 150
Crossbar Networks:
– A crossbar switch has a number of input and output data pins and a number
of control pins.
– In response to control instructions set to its control input, the crossbar
switch implements a stable connection of a determined input with a
determined output.
– A m × n crossbar network
• With m processors and n memories, one processor may be able to generate
requests for multiple memories in sequence; thus several switches might be set
in the same row.
• For m × m interprocessor communication, each PE is connected to both an input
and an output of the crossbar; only one switch in each row and column can be
turned on simultaneously. Additional control processors are used to manage the
crossbar itself.
10/1/2020 151
Crossbar Networks:
– Each junction is a switching component – connecting the row to the column.
10/1/2020 152
Principles of Scalable Performance
Chapter-3
This chapter includes:
– Performance measures
– Speedup laws
– Scalability principles
– Scaling up vs. scaling down
10/1/2020 153
Chapter-3
Performance metrics and measures:-
– Parallelism profiles
– Asymptotic speedup factor
– System efficiency, utilization and quality
– Standard performance measures
10/1/2020 154
Parallelism Profile in Programs
Parallelism Profile in Programs:
– Degree of parallelism  Reflects the matching of software and
hardware parallelism
– Under this topic we study:

• Degree of parallelism
• Average parallelism
• Available parallelism
• Asymptotic speedup factor

10/1/2020 155
Degree of parallelism(DOP):  Reflects the matching of

software and hardware parallelism
– DOP is defined as:For each period, the number of processors

used to execute a program in parallelism profile
• This is a discrete time function only for non-negative integer values
– Parallelism profile of a given program  is a plot of the DOP as

a function of time Ideally have unlimited resources
10/1/2020 156
Degree of parallelism(DOP): Factors affecting parallelism
profiles:
– Algorithm structure
– Program optimization
– Resource utilization
– Run-time conditions
– Realistically limited by  # of available processors, memory,

and otherhttps://hemanthrajhemu.github.io
non-processor resources
10/1/2020 157
Average parallelism:Its variables on which it depends-
– n  Homogeneous Processors
– m  Maximum Parallelism in a Profile

• In ideal case, n >> m
– ∆  Computing Capacity of a Single Processor

• Depends on  execution rate such as MIPS
• no overhead
– DOP = i  i processors busy during an observation period

10/1/2020 158
Average parallelism: Total amount of work W (instructions or
computations) performed is proportional to the area under the
profile curve
This integral is often

computed by the following
discrete summation
ti  Is the total amount of time

that DOP=i
10/1/2020 159
Average parallelism:Is computed by
In discrete form,
we get
10/1/2020 160
Example: Parallelism profile and average parallelism
10/1/2020 161
Asymptotic speedup  the amount of work executed
with DOP=i
For i=1(Single processor-sequential)
W.R.T
Response Time
For i=∞(Infinite number of processors)
10/1/2020 162
Asymptotic speedup  S∞  is defined as the ratio of
T(1) to T(∞)
In ideal case,
n=∞
n >> m
10/1/2020 163
Mean Performance
Performance measures:
– Consider n processors executing m programs in various modes
– Want to define the mean performance of these multimode
computers:
• Arithmetic mean performance
• Harmonic mean performance
10/1/2020 164
Mean Performance
Arithmetic mean performance:
Let {Ri} be the execution rate of programs i=1,2,3,….,m.
Arithmetic mean execution rate is defined as:
Ex1: Average time to complete a program (10 secs, 20 secs, 15 secs)  15

secs •
Ex2:Consider the scores (1, 2, 2, 2, 3, 9). The arithmetic mean is 3.17, but five
out of six scores are below this.
10/1/2020 165
Mean Performance
Arithmetic mean performance:
Let {Ri} be the execution rate of programs i =1,2,3,….,m
Note: If the programs are weighted with a distribution
π={fi | i=1,2,…..m}
Weighted Arithmetic mean execution rate is defined as:
By replacing 1/m with fi
10/1/2020 166
Mean Performance
Harmonic mean performance:
10/1/2020 167
Mean Performance
Harmonic mean performance:
10/1/2020 168
Mean Performance
Harmonic mean Speedup:
10/1/2020 169
Mean Performance-Amdahl’s Law
10/1/2020 170
Mean Performance - Amdahl’s Law
10/1/2020 171
Efficiency, Utilization and Quality
10/1/2020 172
10/1/2020 173
10/1/2020 174
10/1/2020 175
Mflops  mega floating-point operations

per second
KLIPS https://hemanthrajhemu.github.io
kilo logic inference per second used to indicate the
10/1/2020
reasoning powerDepartment
of an AIof Computer
machine Science and Engg
176
Parallel Processing Applications
Few applications listed:-
– Drug design
– High-speed civil transport
– Ocean modelling
– Ozone depletion research
– Air pollution
– Digital anatomy
10/1/2020 177
Massively parallel processing has become one of the challenges in

supercomputer applications.
Massive Parallelism for Grand Challenges:Which varies with time.
– Any machine having hundreds and thousands of processors is a Massively

Parallel Processing (MPP) System.
– Here we will discuss the grand challenges identified in the U.S. High
Performance Computer and Communication (HPCC) program.
10/1/2020 178
10/1/2020 179
Exploiting Massive Parallelism:
– Instruction level parallelism:

• Very limited
• Very few processors exists which can execute 2 instructions in one cycle.
• Depends upon complier, OS incapability's and program flow built in modern computers.
– Data parallelism
• Higher than Instruction level parallelism.
• Deals with large array/operands
– On Message passing multicomputer  Parallelism is more scalable than

shared multiprocessor
10/1/2020 180
Parallel Processing Applications ( Application Models)
Application Models for Parallel Processing:
– Keeping the workload as constant (α)  Efficiency E decreases rapidly as

Machine Size n increases
– Because: Overhead h increases faster than machine size

10/1/2020 181
Application Models for Parallel Processing:
– To maintain the efficiency at desired level  we has to increase the machine
size and program size proportionally  Know as Scalable computers/Scaled
problems
– Workload with linear function of n (represented as γ)-linear scalability in
problem size.
– If linear not possible we can go to Sublinear (β) by keeping some constant value
– If there is enormous growth in workload (θ)  system is poorly scalable To
keep constant efficiency  Exceeds Memory and I/O limits
10/1/2020 182
Application Models for Parallel Processing: 3 Models  Depends on limited
memory, limited latency tolerance , limited I/O bandwidth
The models are:
1. Fixed load model:
• Corresponds to constant workload (α)
• The use is limited by the communication bound.

2. Fixed time model:
• Demands constant execution time
• Linear workload growth refers to this model (γ)

3. Fixed memory model:
• Is limited by memory bound
• Corresponds to workload curve between γ and θ
10/1/2020 183
( Scalability of Parallel Algorithms)
Algorithmic characteristics:-
– Parallel algorithms are specially designed for parallel computers
– Important characteristics of parallel algorithms which are machine

implementable are:
• Deterministic versus Non-Deterministic
• Computational granularity
• Parallelism Profile
• Communicational patterns and synchronization requirements
• Uniformity of the operations
• Memory requirement and data structures.

10/1/2020 184
( Scalability of Parallel Algorithms)
Isoefficiency:- It is the concept 
– Relating workload (w = w(s) // s-Problem size) to machine size n  needed

to maintain a fixed efficiency For a parallel algorithm on a parallel
computer
– Let h  Communication overhead of the algorithm implemented
 h=h(s , n)
– Efficiency is given by : -
– With Fixed problem size  Efficiency decreases as n increases
– With Fixed machine size  Efficiency increases as s increases with n fixed

10/1/2020 185
Speedup Performance Laws
10/1/2020 186
10/1/2020 187
10/1/2020 188
10/1/2020 189
10/1/2020 190
10/1/2020 191
10/1/2020 192
10/1/2020 193
10/1/2020 194
10/1/2020 195
10/1/2020 196
G(n)Reflects the increase in workload has memory increases n times.
10/1/2020 197
10/1/2020 198
10/1/2020 199
Scalability Analysis and Approaches
10/1/2020 200
Scalability Metrics and Goals:
10/1/2020 201
Attained by varying
only the number of
processor(n)
202
203
Scalability Issues and Possible solutions:
10/1/2020 204
205

17CS72 Mod 1 PPT

Uploaded by

Copyright:

Available Formats

17CS72 Mod 1 PPT

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

17CS72 Mod 1 PPT

Uploaded by

Copyright:

Available Formats

Visit : https://hemanthrajhemu.github.

Join Telegram to get Instant Updates: https://bit.ly/VTU_TELEGRAM

Contact: MAIL: futurevisionbie@gmail.com

WHATSAPP SHARE: https://bit.ly/FVBIESHARE

Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”,

Syllabus: MODULE Theory of Parallelism

Parallel Computer Models 1

• First Generation (1945 – 54)

• Second Generation (1955 – 64)

• Third Generation (1965 – 74)

• Fourth Generation (1975 – 90)

• Fifth Generation (1991 onwards)

• The hardware, software, and programming elements of modern computer

• A modern computer is an integrated system consisting of :

• Traditional algorithms and data structures are designed

• Processors, memory, and peripheral devices form the hardware core of a

• System software is needed for the development of efficient programs in HLL.

There Complier upgrade approaches:

• The study of computer architecture involves both the following:

Flynn’s classification SISD(Single instruction stream, single data stream)

Flynn’s classification SIMD(Single instruction stream, multiple data stream)

Flynn’s classification MIMD(Multiple instruction stream, multiple data stream)

Flynn’s classification MISD(Multiple instruction stream, single data stream)

Six layers of computer system development:

This features are up to

• The ideal performance of a computer system demands a perfect match between

Clock Rate and CPI:

Performance Factors (1):

The five performance factors (Ic , p, m, k, τ) are influenced by four system

• Instruction-set architecture (affects Ic and p)

The MIPS rate is:

Source Program  Sequential coded

• Complier has to detect parallelism

Source Program  Parallelism is

• Complier needs to preserve

• Synchronization and communication among

• Two types of Multiprocessors:

Example: Performance Calculation:

Example: Performance Calculation::(Parallel Processor)

Example: Performance Calculation:

• Shared memories, but access time depends on the

• In the COMA model,

It uses the concept of both:

• Each node is autonomous, with its own

• Single Address Space

Supercomputer (Have highest operational rate) Classification:

 Pipelined Vector machine/ Vector Supercomputers:

 SIMD Computers / Parallel Processors:

Step 4: If the decoded instruction is a scalar operation or a program control operation, it

– A fixed number of possibly reconfigurable registers

– All registers are accessible in user instructions

– All vector registers length will be fixed(e.g.64 bits)

Model 2: Memory-to-memory architecture

– Primary memory holds operands and results

– A vector stream unit accesses memory for fetches

(1) N = No. of Processing Elements (PE) in the machine.

(2) C =Set of instructions directly executed by the control unit (CU).

(4) M = Set of Masking Schemes

(5) R = Set of data-routing functions