17CS72 Mod 1 PPT
17CS72 Mod 1 PPT
17CS72 Mod 1 PPT
io
INSTAGRAM: www.instagram.com/hemanthraj_hemu/
INSTAGRAM: www.instagram.com/futurevisionbie/
17CS72
MODULE-1
Theory of Parallelism
1 Conditions of Parallelism 1
Program Partitioning and Scheduling 1
Program Flow Mechanisms, System Interconnect
1
Architectures
Principles of Scalable Performance 1
Performance Metrics and Measures, Parallel Processing 1
Applications
Speedup Performance Laws, Scalability Analysis and 1
Approaches
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 2
Theory of Parallelism
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 3
THE STATE OF COMPUTING: Computer Development Milestones
• How it all started…
500 BC: Abacus (China) – The earliest mechanical computer
/calculating device.
Operated to perform decimal arithmetic with carry propagation digit by digit
1642: Mechanical Adder/Subtractor (Blaise Pascal)
1827: Difference Engine (Charles Babbage)
1941: First binary mechanical computer (Konrad Zuse; Germany)
1944: Harvard Mark I (IBM)
The very first electromechanical decimal computer as proposed by Howard Aiken
• Computer Generations
1 st 2 nd 3 rd 4 th 5 th
Division into generations marked primarily by changes in
hardware and software technologies
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 4
THE STATE OF COMPUTING: Computer Generations
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 5
THE STATE OF COMPUTING: Computer Generations
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 6
THE STATE OF COMPUTING: Computer Generations
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 7
THE STATE OF COMPUTING: Computer Generations
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 8
THE STATE OF COMPUTING: Computer Generations
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 10
THE STATE OF COMPUTING: Elements of Modern Computers
Computing problems
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 11
THE STATE OF COMPUTING: Elements of Modern Computers
Algorithms and Data Structures
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 12
THE STATE OF COMPUTING: Elements of Modern Computers
Hardware Resources
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 13
THE STATE OF COMPUTING: Elements of Modern Computers
Operating System
• Operating systems manage the allocation and deallocation of resources during user
program execution.
• An OS plays a significant role in mapping hardware resources to algorithmic and data
structures.
– Implementation of Mapping:
• Relies on efficient Compiler and Operating system support.
• Parallelism can be exploited at Algorithm :
– Design Time
– Program Time
– Compile Time
– Run Time
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 14
THE STATE OF COMPUTING: Elements of Modern Computers
System Software Support
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 15
THE STATE OF COMPUTING: Elements of Modern Computers
Complier Support
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 16
THE STATE OF COMPUTING: Evolution of Computer Architecture
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 17
THE STATE OF COMPUTING: Evolution of Computer Architecture
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 18
THE STATE OF COMPUTING: Evolution of Computer Architecture
Look-ahead, parallelism and pipelining
• Look-ahead
– Introduced to pre-fetch Instruction to overlap Instruction Fetch & Execute
– They enable function parallelism
• Parallelism
– Supported by two approaches:
• Use of multiple functional units simultaneously.
• Practicing pipelining at various processing levels Includes
– Pipelined Instruction execution
– Pipelined arithmetic computations
– Memory access operations
• Pipelining
– Performs identical operations repeatedly on vector data strings
– Uses Scalar pipeline processors
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 19
THE STATE OF COMPUTING: Evolution of Computer Architecture
Flynn’s classification
• This taxonomy distinguishes multi-processor computer architectures according to two
independent dimensions :
• Instruction stream
– Sequence of instructions executed by machine
• Data stream.
– Sequence of data including input, partial or temporary results used by instruction stream
• Each of these dimensions can have only one of two possible states:
– Single or Multiple
• Flynn’s classification depends on the distinction between :
– The performance of control unit
– The data processing unit rather than its operational and structural interconnections.
• Following are the four category of Flynn classification:
– SISD(Single instruction stream, single data stream)
– SIMD(Single instruction stream, multiple data stream)
– MIMD(Multiple instruction stream, multiple data stream)
– MISD(Multiple instruction stream, single data stream)
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 20
THE STATE OF COMPUTING: Evolution of Computer Architecture
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 21
THE STATE OF COMPUTING: Evolution of Computer Architecture
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 22
THE STATE OF COMPUTING: Evolution of Computer Architecture
Parallel/Vector Computers
Parallel computers are those that execute programs in MIMD mode.
• There are two major classes of parallel computers
– Shared-memory multiprocessors
– message-passing multicomputers.
• The major distinction between multiprocessors and multicomputers lies :
– Memory sharing
• Multiprocessors: Processors communicate through shared variable in a common memory
• Multicomputer: Each computer has a local memory unshared with other computers.
– Inter-processor communication
• Through message passing for a multi computer
Vector Processors:
• Equipped with multiple vector pipelines that can be concurrently used under
hardware or firmware control
• Two families :
– Memory-to-memory architecture
– Register-to-register architecture
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 23
THE STATE OF COMPUTING: Evolution of Computer Architecture
Development Layers
Programmers View
Point these 2 layers
should be architecture
transparent
As a good computer architect One has to approach problem from both the ends
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 24
THE STATE OF COMPUTING: System Attributes Affecting Performance
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 25
THE STATE OF COMPUTING: System Attributes Affecting Performance
Performance Indicators:
• The simplest measure of program performance is the turnaround time , which
includes :
– Disk and memory accesses
– Input and output
– Compilation time
– Operating system overhead
– CPU time
• Since I/O and system overhead frequently overlaps processing by other programs,
it is fair to consider only :
– The CPU time used by a program.
– The user CPU time is the most important factor.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 26
THE STATE OF COMPUTING: System Attributes Affecting Performance
System Attributes:
Total CPU time can be used as a basis in estimating the execution rate of a
processor.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 29
THE STATE OF COMPUTING: System Attributes Affecting Performance
System Attributes:
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 30
THE STATE OF COMPUTING: System Attributes Affecting Performance
MIPS Rate:
If C is the total number of clock cycles needed to execute a given
program =>
• Then total CPU time can be estimated as:
– T = C × τ = C / f.
• Other relationships are easily observed:
– CPI = C / Ic
– T =Ic × CPI × τ
– T =Ic × CPI / f
Processor speed is often measured in terms of millions of
instructions per second, frequently called the MIPS rate of the
processor
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 31
THE STATE OF COMPUTING: System Attributes Affecting Performance
MIPS Rate:
Note:
All four system attributes (instruction set, compiler, processor, and memory
technologies) affect the MIPS rate, which varies also from program to program.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 32
THE STATE OF COMPUTING: System Attributes Affecting Performance
Throughput Rate:
The number of programs a system can execute per unit time,
Ws, in programs per second.
CPU throughput, Wp, is defined as :
Wp = f/(Ic * CPI)
In a multiprogrammed system, the system throughput is often
less than the CPU throughput.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 33
THE STATE OF COMPUTING: System Attributes Affecting Performance
Programming Enviroments:
• Programmability depends on the :
– Programming environment provided to the users.
• Conventional computers are used in a sequential programming
environment
– With tools developed for a uniprocessor computer.
• Parallel computers need parallel tools:
– That allow specification or easy detection of parallelism
– Operating systems that can perform parallel scheduling of:
• Concurrent events
• Shared memory allocation
• Shared peripheral
• Communication links.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 34
THE STATE OF COMPUTING: System Attributes Affecting Performance
Approach of Parallelism
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 35
THE STATE OF COMPUTING: System Attributes Affecting Performance
Approach of Parallelism
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 36
Multiprocessors and Multicomputers
Categories of Parallel Computers:
• Considering their architecture only, there are two main categories of
parallel computers:
– Systems with shared common memories(Shared Memory Multiprocessors)
• The UMA Model
• The NUMA Model
• The COMA Model
• The CC-NUMA Model
– Systems with unshared distributed memories(Distributed-Memory
Multicomputers)
• The NORMA Machines
• Message Passing multicomputers
– Taxonomy of MIMD Computers
– Representative Systems
• Multiprocessors: BBN TC-200, MPP, S-81, IBM ES/9000 Model 900/VF,
• https://hemanthrajhemu.github.io
Multicomputers: Intel Paragon XP/S, nCUBE/2 6480, SuperNode 1000, CM5, KSR-1
Department of Computer Science and Engg
10/1/2020 37
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors( Uniform Memory Access Model UMA Model)
• Physical memory uniformly shared by
all processors
– Has equal access time to all words.
• Processors may have local cache
memories.
• Peripherals also shared in some
fashion.
• Tightly coupled systems use a :
– common bus, crossbar, or multistage
network to connect processors,
peripherals, and memories.
• Many manufacturers have
multiprocessor (MP) extensions of
uniprocessor (UP) product lines.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 38
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors(UMA Model)
• Symmetric MP systems :
– All processors have access to all peripherals
– Any processor can run the OS and I/O device
drivers.
• Asymmetric MP systems :
– Not all peripherals accessible by all processors
– Kernel runs only on selected processors (master)
– Others are
https://hemanthrajhemu.github.io
called attached processors (AP).
Department of Computer Science and Engg
10/1/2020 39
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors(UMA Model)
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 40
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors(UMA Model)
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 44
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors(Cache Only Memory Access Model -COMA)
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 45
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors
(Cache Coherent –Non Uniform Memory Access CC-NUMA)
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 46
Multiprocessors and Multicomputers:
Distributed Memory Multicomputers
• Multicomputer consist of multiple
computers/nodes, interconnected by a
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 49
Taxonomy of MIMD Computers
Scalable
Not Scalable
Scalable
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 50
Multivector and SIMD Computers
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 51
Multivector and SIMD Computers
Vector computer Is built on top of Scalar
Vector Supercomputers Processor.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 52
Multivector and SIMD Computers
Vector Supercomputers
Step 1-2: Program & data are first loaded into the Main Memory through a Host
computer.
Step 3: All instructions are first decoded by the Scalar Control Unit.
Step 5: If the instructions are decoded as a Vector operation, it will be sent to the vector
control unit.
Step 6: Vector control unit will supervise the flow of vector data between the main
memory and vector functional pipelines.
Note: A number of vector functional pipelines may be built into a Vector Processor.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 53
Multivector and SIMD Computers
Vector Supercomputers - 2 Models
Model 1:Register-to-register architecture
• Which are used to hold all vector operands, intermediate, and final vector
results.
– Stores into the main memory in large units superwords (e.g.512 bits)
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 54
Multivector and SIMD Computers
SIMD Supercomputers
Single Instruction Stream and over multiple data stream
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 55
Multivector and SIMD Computers
SIMD Supercomputers
An operational model of an SIMD computer is specified by a 5-tuple: (N, C, I, M, R)
(3) I = Set of instructions broadcast by the CU to all PEs for parallel execution.
• Include: Arithmetic, logic, data routing, masking,
• And other local operations executed by each active PE over data within that PE.
Objective:
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 58
PRAM and VLSI Models
Parallel Random Access Machines :
o NP Completeness:
» An algorithm has polynomial complexity if there exits polynomial p(s) and its
time complexity O(p(s)) for a problem size s.
» P ⊂ NP
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 59
PRAM and VLSI Models
PRAM Models
Conventional Uniprocessor computers
• Modelled as random access machines (RAM)
Parallel random access machines
• Modelled to idealize parallel computers with zero synchronization / memory access
overhead.
• Used parallel algorithm development, scalability & complexity analysis.
• Globally addressable shared memory distributed among n Processor Elements (PEs) /
centralized.
• PEs operate on synchronized read memory, compute, write memory cycle.
Four memory-update options in PRAMs:
Exclusive Read (ER): Read takes 1 Cycle atmost 1 Processor
Exclusive Write (EW): Write takes 1 Cycle atmost 1 Processor
Concurrent read (CR):Read of a memory location takes Same Cycle for multiple Processor
Concurrent Write (CW):Simultaneously Writes to a same memory location conflict need to be
avoided
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 60
PRAM and VLSI Models
PRAM Variants:- Classified based on Memory read/write : Four memory-update
1. EREW PRAM model:
Prohibits more than 1 processor from reading or writing simultaneously to same cell.
2. CREW PRAM model:
Mutual exclusion used to avoid write conflicts.
Allows concurrent reads from same memory.
3. ERCW PRAM model:
Allows exclusive reads/ concurrent writes from same memory.
4. CRCW PRAM model:
Allows either concurrent reads/ concurrent writes from same memory.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 61
PRAM and VLSI Models
Discrepancy with Physical Models of PRAM Variants:-
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 62
PRAM and VLSI Models
VLSI Complexity Model:-
Parallel computers rely on the use of VLSI chip
To fabricate components such as Processor arrays, Memory arrays, Switching networks
etc.
Nowadays, VLSI technologies are 2-dimensional.
The size of a VLSI chip is proportional to the amount of storage (memory) space
available in that chip.
We deal with AT2 model (2-dimension)
We can calculate the space complexity of an algorithm by :
The chip area (A) of the VLSI chip implementation of that algorithm.
If T is the time (latency) needed to execute the algorithm
Then A.T gives an upper bound on the total number of bits processed through the chip (or
I/O).
For certain computing, there exists a lower bound, f(s), such that
A.T2 >= O (f(s))
Where A=chip area and T=time
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 63
PRAM and VLSI Models
VLSI Complexity Model:-
Memory Bound on Chip Area:
Many computations are memory bound has there will be a need in processing large data
sets
Memory requirement of computation sets a lower bound on the chip area (A)
Amount of information processed by the chip can be visualized as information flow
upward across the chip area
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 64
PRAM and VLSI Models
VLSI Complexity Model:-
Bisection Communication Bound,
The bisection is represented by the vertical slice across the chip area A.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 65
Chapter-2
Program and Network Properties
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 66
Conditions of Parallelism
Three key areas for significant progress in parallel processing:
Computational models for parallel computing
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 67
Conditions of Parallelism
Data dependence and resource dependence
The ability to execute several program segments in parallel:
• Requires each segment to be independent of the other segments.
• We use a dependence graph to describe the relations.
• The nodes of a dependence graph correspond to the program statement
(instructions)
• Directed edges with different labels are used to represent the ordered
relations among the statements.
• The analysis of dependence graphs shows where opportunity exists for
parallelization.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 68
Conditions of Parallelism
Data dependence and resource dependence
Data Dependence : The ordering relationship between statements is
indicated by the data dependence.
Five Types:
1. Flow dependence
2. Antidependence
3. Output dependence
4. I/O dependence
5. Unknown dependence
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 69
Conditions of Parallelism
Data dependence and resource dependence
1. Flow dependence : A statement S2 is flow dependent on S1 RAW
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 70
Conditions of Parallelism
Data dependence and resource dependence
3. Output dependence : Two statements are output dependent if they produce
(write) the same output variable. Also called WAW hazard and denoted as
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 71
Conditions of Parallelism
Data dependence and resource dependence
5. Unknown dependence: The dependence relation between two statements
cannot be determined in the following situations:
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 72
Conditions of Parallelism
Data dependence and resource dependence
Control Dependence: Dependency is know at Run time
– Data and control dependencies are based on the independence of the work
to be done.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 74
Conditions of Parallelism
Data dependence and resource dependence
Bernstein’s Conditions - 1 : Bernstein’s conditions are a set of conditions which
must exist if two processes can execute in parallel.
– Notation
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 75
Conditions of Parallelism
Data dependence and resource dependence
Bernstein’s Conditions - 2 : In terms of data dependencies, Bernstein’s conditions
imply that two processes can execute in parallel if they are:
(Pi || Pj implies Pj || Pi ),
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 76
Conditions of Parallelism
Hardware and Software Parallelism
Hardware Parallelism : Hardware parallelism is defined by machine architecture
and hardware multiplicity.
– It can be characterized by the number of instructions that can be issued per
machine cycle.
– If a processor issues k instructions per machine cycle, it is called a k-issue
processor.
– Conventional processors are one-issue machines.
– A machine with n processors with k-issue should be able to handle a
maximum of nk threads simultaneously.
– Examples:
• Intel i960CA is a three-issue processor (arithmetic, memory access, branch)
• IBM RS-6000 is a four-issue processor (arithmetic, floating-point, memory access,
branch)
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 77
Conditions of Parallelism
Hardware and Software Parallelism
Hardware Parallelism :
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 78
Conditions of Parallelism
Hardware and Software Parallelism
Hardware Parallelism :
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 79
Conditions of Parallelism
Hardware and Software Parallelism
Software Parallelism :
– Software parallelism is defined by the control and data dependence of programs,
– Is revealed in the program’s flow graph.
– It is a function of algorithm, programming style, and compiler optimization.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 80
Conditions of Parallelism
Hardware and Software Parallelism
Types of Software Parallelism :
1. Control Parallelism:
• Two or more operations can be performed simultaneously.
• This can be detected by a compiler, or a programmer can explicitly indicate control
parallelism
– By using special language constructs or dividing a program into multiple
processes.
2. Data parallelism:
• Multiple data elements have the same operations applied to them at the same
time.
• This offers the highest potential for concurrency (in SIMD and MIMD modes).
• Synchronization in SIMD machines handled by hardware.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 81
Conditions of Parallelism
Hardware and Software Parallelism
The Role of Compilers :
– Compilers used to exploit hardware features to improve performance.
– It is not necessarily the case that more software parallelism will improve performance
in conventional scalar processors.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 82
Conditions of Parallelism
Program Partitioning & Scheduling
Program Partitioning & Scheduling :
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 83
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
– The sizes are roughly classified using the term “granule size,” or simply “granularity”
– The simplest measure, for example, is the number of instructions in a program part
– Grain sizes are usually described as fine, medium or coarse, depending on the level of
parallelism involved
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 84
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Parallelism has exploited at various Processing levels:
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 85
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Instruction Level Parallelism:
– Advantages:
• There are usually many candidates (Programs) for parallel execution
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 86
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Loop Level Parallelism:
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 87
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Procedure Level Parallelism:
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 88
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 89
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 90
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Final Summary :
– Fine-grain parallelism is exploited at instruction or loop levels
• Assisted by the compiler.
– Medium-grain parallelism is At task or job step level
• Requires programmer and compiler support.
– Coarse-grain parallelism is At Program Level
• Relies heavily on effective OS support.
– Shared-variable communication used at fine- and medium-grain levels.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 92
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Inter-process Communication Latency:
– Example:
• n communicating tasks may require
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 93
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Packing and Scheduling )
1. How can I partition a program into parallel “pieces” to yield the shortest
execution time?
• Time spent for scheduling and synchronizing parallel grains and the speedup
obtained by parallel execution.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 95
Conditions of Parallelism
Program Partitioning & Scheduling
(Grain Determination and Scheduling Optimization)
Includes 4 steps:
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 96
Conditions of Parallelism
Program Partitioning & Scheduling(Program Graphs and Packing)
A program graph is like a dependence graph
• Nodes= {(n, s)} where n= node name and s=
size
– Larger the s = larger grain size
– Edges= {(v,d)} where v= communicated variable
and d=communication delay
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 98
Conditions of Parallelism
Program Partitioning & Scheduling(Program Graphs and Packing)
– Each node in the program graph corresponds to a
computational unit in the program.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 100
Program Flow Mechanisms
Control flow mechanism
Control flow mechanism(used by conventional computers):
– Conventional von Neumann computers use a
– Control flow computers use shared memory to hold program instructions and data
objects.
– Control flow can be made parallel by using parallel language constructs or parallel
compiler.
– Control flow machines give complete control, but are less efficient than other
approaches.https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 101
Program Flow Mechanisms
Data flow mechanism
Data-driven mechanism(used by data-flow computers):
• High potential for parallelism and throughput and freedom from side effects,
• But have high control overhead, lose time waiting for unneeded arguments, and
difficulty in manipulating data structures.
– Special mechanisms are required to detect data availability, match data tokens
with instructions needing them, enable chain reaction of asynchronous instruction
execution https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 102
Program Flow Mechanisms
Demand flow mechanism
Demand-driven mechanism(used by reduction computers):
• Eager Evaluation
• This triggers the execution of instructions that yield its operands, and so forth.
• Lazy Evaluation
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 103
Program Flow Mechanisms
Demand flow mechanism
Demand-driven mechanism(used by reduction computers): 2 Reduction Models:
1. String-reduction model:
2. Graph-reduction model:
• Processors
• Memory modules
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 106
System Interconnect Architectures
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 107
System Interconnect Architectures
Network Properties and Routing
– Static Networks:
• Point-to-Point direct connections which will not change during program
execution
– Dynamic Networks:
• Implemented with switched channels.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 108
System Interconnect Architectures
Network Properties and Routing
Goals and Analysis:
– Analysis includes
• Latency
• Bisection bandwidth
• Data-routing functions
• Scalability
https://hemanthrajhemu.github.io
of parallel architecture
Department of Computer Science and Engg
10/1/2020 109
System Interconnect Architectures
Network Properties and Routing
Terminologies used in Network Properties and Routing:
– Diameter D of a network
• Is the maximum shortest path between any two nodes
– Wire (or channel) length length (e.g. weight) of edges between nodes.
– Network is symmetric if the topology is the same looking from any node;
these are easier to implement or to program.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 112
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Is used for inter-PE data exchange
– Shifting
– Shuffle
– Exchange Etc.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 113
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Permutation (one to one)
– Given n objects, there are n ! ways in which they can be reordered (one of
which is no reordering).
– For example: π = (a , b , c) (d , e)
• Possible mapping are: ab, bc, ca, de, ed in a circular fashion.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 115
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Perfect Shuffle and Exchange
ab…k to bc…ka
– That is, shifting 1 bit to the left and wrapping it around to the least
significant bit position.
– The inverse perfect shuffle reverses the effect of the perfect shuffle.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 116
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Perfect Shuffle and Exchange
Perfect Shuffle
(Left Shift)
Inverse Perfect
Shuffle
(Right Shift)
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 117
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Hypercube Routing Functions
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 118
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Hypercube Routing Functions
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 119
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Broadcast and Multicast
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 120
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Network Performance
– Functionality:-
• How the network supports data routing, interrupt handling, synchronization,
request/message combining, and coherence.
– Network latency:-
• Worst-case time for a unit message to be transferred.
– Bandwidth
• Maximum data rate
– Hardware complexity
• Implementation costs for wire, logic, switches, connectors, etc.
– Scalability
• How easily does the scheme adapt to an increasing number of processors,
memories, etc.?
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 121
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Static networks use direct links which are fixed
once built.
– Barrel Shifter
– Fat Tree
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 122
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Linear Array
– Diameter = n-1
– Bisection = 1
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 124
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Ring, Chordal
Ring
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 126
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Barrel Shifter
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 127
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Barrel Shifter
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 128
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Tree and Star
– The balanced binary tree is scalable, since it has a constant maximum node
degree.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 129
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Tree ,Star and Fat Tree
– A star is a two-level tree with a node degree d = N – 1 and a constant diameter
of 2.
– A fat tree is a tree in which the number of edges between nodes increases
closer to the root
– The edges represent communication channels (“wires”).
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 130
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Mesh and Torus
– Pure mesh:
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 131
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Mesh and Torus
– A torus mesh :
• Has ring connections in each dimension, and is symmetric.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 132
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Systolic Array
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 133
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Hypercubes
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 134
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Cube-connected Cycles
– Price can be attributed to the cost of wires, switches, arbiters, and connectors.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 139
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Bus Systems
– The throughput of the network based on a bus can be increased by the use
of a multibus network shown in the figure below.
– In this network, processors connected to the busses can transmit data in
parallel (one for each bus) and many processors can read data from many
busses at a time.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 140
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Switch Modules
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 141
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Switch Modules
– However, multiple inputs may not be connected to the same output. When
only one-to-one mappings are allowed, the switch is called a crossbar switch.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 142
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Multistage Interconnection Networks
Single-stage networks:
• Perfect shuffle
operation: cyclic shift
1 place left, eg 101 -->
011
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 144
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Multistage Interconnection Networks
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 146
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Multistage Interconnection Networks
This type of networks can be classified into the following four categories:
– Nonblocking : A network is called strictly nonblocking if it can connect any
idle input to any idle output regardless of what other connections are
currently in process
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 148
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Multistage Interconnection Networks
Baseline Networks:
– A baseline network can be shown to be topologically equivalent to other
networks (including Omega), and has a simple recursive generation procedure.
– Stage k (k = 0, 1, …) is an m × m switch block (where m = N / 2k ) composed
entirely of 2 × 2 switch blocks, each having two configurations: straight
through and crossover.
– The network can be generated recursively
– The first stage N × N, the second (N/2) × (N/2).
– Networks are topologically equivalent if one network can be easily reproduced
from the other networks by simply rearranging nodes at each stage
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 149
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Multistage Interconnection Networks
Baseline Networks:
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 150
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Multistage Interconnection Networks
Crossbar Networks:
– A crossbar switch has a number of input and output data pins and a number
of control pins.
– In response to control instructions set to its control input, the crossbar
switch implements a stable connection of a determined input with a
determined output.
– A m × n crossbar network
• With m processors and n memories, one processor may be able to generate
requests for multiple memories in sequence; thus several switches might be set
in the same row.
• For m × m interprocessor communication, each PE is connected to both an input
and an output of the crossbar; only one switch in each row and column can be
turned on simultaneously. Additional control processors are used to manage the
crossbar itself.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 151
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Multistage Interconnection Networks
Crossbar Networks:
– Each junction is a switching component – connecting the row to the column.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 152
Principles of Scalable Performance
Chapter-3
This chapter includes:
– Performance measures
– Speedup laws
– Scalability principles
– Scaling up vs. scaling down
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 153
Principles of Scalable Performance
Chapter-3
Performance metrics and measures:-
– Parallelism profiles
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 154
Principles of Scalable Performance
Parallelism Profile in Programs
Parallelism Profile in Programs:
– Degree of parallelism Reflects the matching of software and
hardware parallelism
• Average parallelism
• Available parallelism
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 156
Principles of Scalable Performance
Parallelism Profile in Programs
Degree of parallelism(DOP): Factors affecting parallelism
profiles:
– Algorithm structure
– Program optimization
– Resource utilization
– Run-time conditions
• no overhead
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 159
Principles of Scalable Performance
Parallelism Profile in Programs
Average parallelism:Is computed by
In discrete form,
we get
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 160
Principles of Scalable Performance
Parallelism Profile in Programs
Example: Parallelism profile and average parallelism
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 161
Principles of Scalable Performance
Parallelism Profile in Programs
Asymptotic speedup the amount of work executed
with DOP=i
For i=1(Single processor-sequential)
W.R.T
Response Time
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 162
Principles of Scalable Performance
Parallelism Profile in Programs
Asymptotic speedup S∞ is defined as the ratio of
T(1) to T(∞)
In ideal case,
n=∞
n >> m
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 163
Principles of Scalable Performance
Mean Performance
Performance measures:
– Consider n processors executing m programs in various modes
– Want to define the mean performance of these multimode
computers:
• Arithmetic mean performance
• Harmonic mean performance
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 164
Principles of Scalable Performance
Mean Performance
Arithmetic mean performance:
Let {Ri} be the execution rate of programs i=1,2,3,….,m.
Arithmetic mean execution rate is defined as:
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 166
Principles of Scalable Performance
Mean Performance
Harmonic mean performance:
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 167
Principles of Scalable Performance
Mean Performance
Harmonic mean performance:
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 168
Principles of Scalable Performance
Mean Performance
Harmonic mean Speedup:
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 169
Principles of Scalable Performance
Mean Performance-Amdahl’s Law
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 170
Principles of Scalable Performance
Mean Performance - Amdahl’s Law
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 171
Principles of Scalable Performance
Efficiency, Utilization and Quality
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 172
Principles of Scalable Performance
Efficiency, Utilization and Quality
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 173
Principles of Scalable Performance
Efficiency, Utilization and Quality
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 174
Principles of Scalable Performance
Efficiency, Utilization and Quality
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 175
Principles of Scalable Performance
Efficiency, Utilization and Quality
KLIPS https://hemanthrajhemu.github.io
kilo logic inference per second used to indicate the
10/1/2020
reasoning powerDepartment
of an AIof Computer
machine Science and Engg
176
Principles of Scalable Performance
Parallel Processing Applications
Few applications listed:-
– Drug design
– High-speed civil transport
– Ocean modelling
– Ozone depletion research
– Air pollution
– Digital anatomy
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 177
Parallel Processing Applications
– Here we will discuss the grand challenges identified in the U.S. High
Performance Computer and Communication (HPCC) program.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 178
Parallel Processing Applications
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 179
Parallel Processing Applications
• Very few processors exists which can execute 2 instructions in one cycle.
• Depends upon complier, OS incapability's and program flow built in modern computers.
– Data parallelism
• Higher than Instruction level parallelism.
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 182
Parallel Processing Applications ( Application Models)
Application Models for Parallel Processing: 3 Models Depends on limited
memory, limited latency tolerance , limited I/O bandwidth
The models are:
1. Fixed load model:
• Corresponds to constant workload (α)
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 183
Parallel Processing Applications
( Scalability of Parallel Algorithms)
Algorithmic characteristics:-
• Computational granularity
• Parallelism Profile
h=h(s , n)
– Efficiency is given by : -
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 186
Speedup Performance Laws
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 187
Speedup Performance Laws
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 188
Speedup Performance Laws
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 189
Speedup Performance Laws
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 190
Speedup Performance Laws
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 191
Speedup Performance Laws
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 192
Speedup Performance Laws
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 193
Speedup Performance Laws
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 194
Speedup Performance Laws
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 195
Speedup Performance Laws
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 196
Speedup Performance Laws
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 197
Speedup Performance Laws
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 198
Speedup Performance Laws
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 199
Scalability Analysis and Approaches
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 200
Scalability Analysis and Approaches
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 201
Scalability Analysis and Approaches
Attained by varying
only the number of
processor(n)
https://hemanthrajhemu.github.io
10/1/2020 Department of Computer Science and Engg
202
Scalability Analysis and Approaches
https://hemanthrajhemu.github.io
10/1/2020 Department of Computer Science and Engg
203
Scalability Analysis and Approaches
Scalability Issues and Possible solutions:
https://hemanthrajhemu.github.io
Department of Computer Science and Engg
10/1/2020 204
https://hemanthrajhemu.github.io
10/1/2020 Department of Computer Science and Engg
205