Multi Processors and Thread Level Parallelism
Multi Processors and Thread Level Parallelism
CONTENT
INTRODUCTION
BASICS OF SYNCHRONIZATION
A TAXONOMY OF PARALLEL
ARCHITECTURES
1.
2.
3.
4.
SISD
- Uniprocessor
SIMD
Same instruction is executed by multiple
processors using different data streams
Exploit data level parallelism
Each processor has its own data memory
Single instruction memory
Control processor to fetch and dispatch
instructions
MIMD
Each processor fetches its own instruction and
operates on its own code
Exploits thread level parallelism
2. Cost performance
CLUSTERS
One class of MIMD
Use standard components and a network
technology
Two types:
Commodity clusters
Custom clusters
COMMODITY CLUSTERS
Rely on 3rd party processors and interconnect
technology
Are often blade / rack mounted servers
Focus on throughput
No communication among threads
Assembled by users rather than vendors
CUSTOMCLUSTERS
Designer customizes either the detailed node
design or the interconnect design or both
Exploit large amount of parallelism
Require significant among of communication
during computation
More efficient
Ex.: IBM Blue gene
MULTICORE
Multi processors placed on a single die
A.k.a. on-chip multiprocessing or single-chip
multiprocessing
Multiple core shares resources (cache, I/O bus)
Ex.: IBM Power 5
PROCESS
Segment of code that may be run independently
Process state contains all necessary information
to execute that program
Each process is independent of the other :multiprogramming environment
THREADS
Multiple processors executing a single program
Share the code and address space
Grain size must be large to exploit parallelism
Independent threads within a process are
identified by the programmer or created by the
compiler
Loop iterations within thread-Exploit data level
parallelism
MIMD CLASSIFICATION
1.
2.
BENEFITS:
1.
Cost effective to scale memory bandwidth
2.
Reduces latency to access local memory
DRAWBACKS:
3.
Complex
4.
Software needed to manage the increased
memory bandwidth
CHALLENGES OF MULTI
PROCESSING
1.
2.
3.
4.
SOLUTION
Limited parallelism : algorithms with better
parallel performance
Access latency : architecture design and
programming
Reduce the frequency of remote access: hardware
and software mechanisms
Tolerate latency: multi threading and prefetching
PROBLEM
=99.75%
26
PROBLEM
1.
2.
3.
1.
2.
SNOOPING PROTOCOLS
1.
Write invalidate
2.
Write update
BASIC IMPLEMENTATION
TECHNIQUES
1.
2.
3.
4.
5.
LIMITATIONS
As the number of processors in a multiprocessor
grow / memory demands grow, any centralized
resource becomes a bottleneck
A single bus has to carry both the coherence
traffic as well as the normal traffic
Designers can use multiple buses and
interconnection networks
Attain a midway approach : shared memory vs
centralized memory
1.
2.
1.
2.
3.
1.
2.
1.
2.
1.
2.
3.
SYNCHRONIZATION
Synchronization mechanisms are built with user
level software routines that rely on hardware
supplies synchronization instructions
Atomic operations: The ability to atomically
read and modify the memory location
Atomic exchange: Inter changes the value in a
register for a value in memory
Locks: 0 is used to indicate a lock is free; 1 is
used to indicate that a lock is unavailable
Simple implementation:
A processor could continually try to acquire the
lock using an atomic operation
E.g.: Exchange and test
To release a lock, the processor stores a 0 to the
lock
Coherence mechanism:
Use cache coherence mechanism to maintain the
lock value coherently
The processor can acquire a locally cached lock
rather than using a global memory
Locality in lock access: The processor that used
the lock last will use it again in near future
Spin procedure:
A processor reads the lock variable to test its
state
This is repeated until the value of the read
indicates that the lock is unlocked
The processor then races with all the other
waiting processors
All processes use a swap function that reads the
old value and stores a 1 into the lock variable
MODELS OF MEMORY
CONSISTENCY
Consistency:
1.
When must a processor see a value that has
been updated by another processor
2.
In what order must a processor observe the
data writes of another processor
. Sequential consistency: Result of any execution
be the same as if the memory accesses executed
by each processor were kept in order and
accesses among different processors are
interleaved
1.
2.
3.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.