0% found this document useful (0 votes)

248 views

Multi Processors and Thread Level Parallelism

This document discusses multi processors and thread level parallelism. It covers symmetric and shared memory architectures, performance of symmetric shared memory architectures, distributed shared memory and directory based coherence. It also discusses basics of synchronization and models of memory consistency. The document outlines factors driving multi processors like server performance growth. It categorizes parallel architectures and discusses SISD, SIMD, MISD and MIMD models. It focuses on MIMD and covers clusters, multi core, processes and threads. Centralized shared memory and distributed memory multi processors are also summarized.

Uploaded by

Prajna S Bhat

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

248 views

Multi Processors and Thread Level Parallelism

Uploaded by

Prajna S Bhat

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 74

UNIT 5

MULTI PROCESSORS AND THREAD

LEVEL PARALLELISM

CONTENT

INTRODUCTION

SYMMETRIC AND SHARED MEMORY ARCHITECTURES

PERFORMANCE OF SYMMETRIC SHARED MEMORY

ARCHITECTURES

DISTRIBUTES SHARED MEMORY AND DIRECTORY BASED

COHERENCE

BASICS OF SYNCHRONIZATION

MODELS OF MEMORY CONSISTENCY

FACTORS THAT TREND TOWARD

MULTIPROCESSOR
1.
2.
3.
4.
5.

A growing interest in servers and server

performance
A growth in data intensive applications
The insight that increasing performance on the
desktop is less important
An improved understanding on how to use
multi processors effectively
Advantages of leveraging a design investment
by replication rather than unique design

A TAXONOMY OF PARALLEL
ARCHITECTURES
1.
2.
3.
4.

Single instruction stream, single data stream

(SISD)
Single instruction stream, multiple data stream
(SIMD)
Multiple instruction stream single data stream
(MISD)
Multiple instruction stream, multiple data
stream (MIMD)

SISD

- Uniprocessor

SIMD
Same instruction is executed by multiple
processors using different data streams
Exploit data level parallelism
Each processor has its own data memory
Single instruction memory
Control processor to fetch and dispatch
instructions

MIMD
Each processor fetches its own instruction and
operates on its own code
Exploits thread level parallelism

FACTORS THAT CONTRIBUTED TO

THE RISE OF MIMD
1. Flexibility

Functions as a single user multiprocessor

Can focus on high performance for one application
Can run multiple tasks simultaneously

2. Cost performance

Use the same micro processor found in

workstations and single processor servers
Multi core chips leverage the design investment
using replication

CLUSTERS
One class of MIMD
Use standard components and a network
technology
Two types:

Commodity clusters
Custom clusters

COMMODITY CLUSTERS
Rely on 3rd party processors and interconnect
technology
Are often blade / rack mounted servers
Focus on throughput
No communication among threads
Assembled by users rather than vendors

CUSTOMCLUSTERS
Designer customizes either the detailed node
design or the interconnect design or both
Exploit large amount of parallelism
Require significant among of communication
during computation
More efficient
Ex.: IBM Blue gene

MULTICORE
Multi processors placed on a single die
A.k.a. on-chip multiprocessing or single-chip
multiprocessing
Multiple core shares resources (cache, I/O bus)
Ex.: IBM Power 5

PROCESS
Segment of code that may be run independently
Process state contains all necessary information
to execute that program
Each process is independent of the other :multiprogramming environment

THREADS
Multiple processors executing a single program
Share the code and address space
Grain size must be large to exploit parallelism
Independent threads within a process are
identified by the programmer or created by the
compiler
Loop iterations within thread-Exploit data level
parallelism

MIMD CLASSIFICATION
1.
2.

Centralized shared memory architectures

Distributed memory processors

CENTRALIZED SHARED MEMORY

ARCHITECTURES

A few dozen processors share a single centralized

memory
Large caches or multiple bank memory
Scaling done using p-2-p connections, switches
and multiple bank memory
Symmetric relationship
Uniform access time
Called as Symmetric Shared Memory
Multiprocessor (SMP) or Uniform Memory Access
(UMA)

DISTRIBUTED MEMORY MULTI

PROCESSORS
Physically distributed memory
Supports large number of processors and
bandwidth
Raises the need for high bandwidth interconnect
Direction networks(switches) and indirection
networks(multidimensional meshes) are used

BENEFITS:
1.
Cost effective to scale memory bandwidth
2.
Reduces latency to access local memory
DRAWBACKS:
3.
Complex
4.
Software needed to manage the increased
memory bandwidth

MODELS FOR COMMUNICATION

AND MEMORY ARCHITECTURE
1.

Communication occurs in a shared address

space

Physically separated memory => one logical shared

address space
Called as Distributed Shared Memory
architecture(DSM) or Non Uniform Memory Access
(NUMA)
Memory reference made by any processor to any
memory location
Access time depends on the data location in
memory

2. Address space consist of multiple private address

spaces
Addresses are logically disjoint
Cannot be addresses by a remote processor
Same physical address(processor) refer to
different memory location
Each processor-memory module is a separate
computer
Communication is done via message passing
A.k.a Message Passing Multiprocessors

CHALLENGES OF MULTI
PROCESSING
1.
2.
3.
4.

Limited parallelism available in program

Relatively high cost of communication
Large latency of remote access
Difficult to achieve good speed up
Performance measured using Amdahls
law

SOLUTION
Limited parallelism : algorithms with better
parallel performance
Access latency : architecture design and
programming
Reduce the frequency of remote access: hardware
and software mechanisms
Tolerate latency: multi threading and prefetching

PROBLEM

Suppose you want to achieve a speedup of 80 with

100 processors. What fraction of the original
computation can be sequential?

assume that the program operates in only two

modes:
1. parallel with all processors fully used, which is
the enhanced mode
2. serial with only one processor in use.

Speedup in enhanced mode =number of processors,

Speed in fraction of enhanced mode = time spent
in parallel mode.

.25% of original computation can be can be

sequential

=99.75%
26

PROBLEM

SHARED SYMMETRIC MEMORY

ARCHITECTURE
Use of multi level caches substantially reduce the
memory bandwidth demands of a processor
Solution: Creation of small scale multi processors
where several processors shared a single physical
memory connected by a shared bus
Benefit: Cost effective
They support caching of private/shared data

Private data: Used by a single processor

Shared data: Shared between multiple processors
How are these cached?

WHAT IS MULTI PROCESSOR CACHE

COHERENCE?

A memory system is said to be coherent:

A read by processor P to a location X that follows a
write by P to X, with no writes of X by another
processor occurring between the write and the
read by P, always returns the value written by P
A read by a processor to location X that follows a
write by another processor to X returns the
written value if the read and write are sufficiently
separated in time and no other writes to X occur
between two accesses
Writes to the same location are serialized; two
writes to the same location by any two processors
are seen in the same order by all processors

Coherence: Defines the behavior of reads and

writes to the same memory location
Consistency: Defines the behavior of reads and
writes w.r.t accesses to other memory location

BASIC SCHEMES FOR ENFORCING

COHERENCE
Coherent caches provide:
1.
Migration: Data item can be moved to a local
cache and used
2.
Replication: Shared data can be
simultaneously read
. The protocols to maintain coherence for
multiple processors are called cache coherence
protocols

Directory based: The sharing status of a

block of physical memory is kept in just one
location: directory
Snooping: Every cache that has a copy of the
data from a block of physical memory has also a
sharing status of the block; no centralized state
is kept

SNOOPING PROTOCOLS
1.

Write invalidate

Write update

BASIC IMPLEMENTATION
TECHNIQUES
1.

2.
3.
4.
5.

The processor acquires bus access and

broadcasts the address to be invalidated on the
bus
Processors continuously snoop on the bus
watching for addresses
The processors check if the address on the bus
in their cache
If so, they invalidate the corresponding data in
their cache
If two processors attempt to write shared blocks
at the same time, their attempts to broadcast
an invalidate operation will be serialized

Write update: Broadcasts the write to all the

cache lines
Consumes bandwidth

Write - through cache: Written data is sent to

memory
The most recent value of the data item be
fetched from memory

Write back cache: Every processor snoops the

address on the bus.
If the processor finds that it has a dirty copy of
the requested cache block, it provides that cache
block on request for a read
This in turn causes the memory operation to be
aborted
The cache block is then retrieved from the
processors cache

To track if a cache block is shared, an extra bit

called state bit is associated with each cache
block
When a write to a shared block occurs, the cache
generates an invalidation on the bus and marks
the block as exclusive
The processor with this sole copy of the block is
called the owner of the block

When an invalidation is sent, the owners sate of

the cache block is changed from shared to
exclusive
Later, if another processor requests for the cache
block, the state has to be made shared again

WRITE INVALIDATE FOR A WRITE

BACK CAHCE
Circles: Cache states
Arcs: State transitions
Label on the arcs: Stimulus that causes state
transition
Bold: Bus actions caused by transitions

LIMITATIONS
As the number of processors in a multiprocessor
grow / memory demands grow, any centralized
resource becomes a bottleneck
A single bus has to carry both the coherence
traffic as well as the normal traffic
Designers can use multiple buses and
interconnection networks
Attain a midway approach : shared memory vs
centralized memory

PERFORMANCE OF SYMMETRIC SHARED

MEMORY MULTI PROCESSORS

Coherence misses can be broken into two sources:

True sharing miss: The first write by a
processor to a shared cache block causes an
invalidation to establish block ownership; a
subsequent attempt to read the modified in that
cache block results in a miss
False sharing miss: The block is invalidate
because some word in the cache block other
than the one being read is written into

PROBLEM 3: Assume that words xl and x2 are

in the same cache block, which is in the shared
state in the caches of both PI and P2. Assuming
the following sequence of events, identify each
miss as a true sharing miss, a false sharing miss,
or a hit. Any miss that would occur if the block
size were one word is designated a true sharing
miss.

DISTRIBUTED SHARED MEMORY

AND DIRECTORRY BASED COHERENCE
A directory keeps state of every cached block
Information in the directory includes which
caches have copies of the block, if they are dirty
and so on
An entry in the directory is associated with each
block
To prevent the directory from becoming a
bottleneck, the directory is distributed along with
the memory.

DIRECTORY BASED CACHE

COHERENCE PROTOCOLS

2.
3.

The state of each cache block could be the

following:
Shared: One or more processors have the block
cached, and the value in memory and all the
caches is up to date
Uncached: No processor has a copy of the
cache block
Modified: Exactly one processor has a copy of
the cached block, and it has written the block,
the memory copy is out of date; the processor is
the owner of the block

To keep track of the each potentially shared

block, a bit vector is maintained for each block.
Each bit indicates if the corresponding processor
has a copy of the block
Local node
Home node
Remote node

DIRECTORY BASED CACHE

COHERENCE PROTOCOLS

When the block is in uncached state, the possible

requests for it are:
Read miss: The requesting processor is sent
the block from memory; the state of the block is
made shared
Write miss: The requesting processor is sent
the value and becomes the sharing node; the
block is made exclusive

When the block is in the shared state, the

memory value is up to date:
Read miss: The requesting processor is sent
the requested data from memory, and the
requesting processor is added to the sharing set
Write miss: The requesting processor is sent
the value. All other processors in the sharers
state are sent invalidate messages and they
contain the identity of the requesting processor;
the state of the block is made exclusive

When the block is in the exclusive state, the

current value of the block is held in the owner
processors cache
Read miss: The owner processor is sent the
data fetch message. The state of the block s
made shared; the requesting processor is added
to the sharers set which contains the identity of
the owner

Data write back: The owners processor is

replacing the block and hence the block has to
be written back. Memory copy is made up to
date, the block is uncached and the sharers set
is empty
Write miss: The block has a new owner. A
message is sent to old owner to invalidate the
block; the state of the block remains exclusive

SYNCHRONIZATION
Synchronization mechanisms are built with user
level software routines that rely on hardware
supplies synchronization instructions
Atomic operations: The ability to atomically
read and modify the memory location
Atomic exchange: Inter changes the value in a
register for a value in memory
Locks: 0 is used to indicate a lock is free; 1 is
used to indicate that a lock is unavailable

Test and set: Tests a value and sets if the value

passes the test
Fetch and increment: It returns a value in
memory and atomically increments it

IMPLEMENTING LOCKS USING

COHERENCE
Spin locks: Locks that a processor continuously
tries to acquire, spinning around a loop until it
succeeds
Are to used when the lock is to be held for a very
short amount of time and the process acquiring
the lock is of low latency

Simple implementation:
A processor could continually try to acquire the
lock using an atomic operation
E.g.: Exchange and test
To release a lock, the processor stores a 0 to the
lock

Coherence mechanism:
Use cache coherence mechanism to maintain the
lock value coherently
The processor can acquire a locally cached lock
rather than using a global memory
Locality in lock access: The processor that used
the lock last will use it again in near future

Spin procedure:
A processor reads the lock variable to test its
state
This is repeated until the value of the read
indicates that the lock is unlocked
The processor then races with all the other
waiting processors
All processes use a swap function that reads the
old value and stores a 1 into the lock variable

The single winner will see a 0 and the losers will

see a 1 that is placed by the winner
The winning processor executes the code after
the lock and then release it by storing a 0 in the
lock variable
The race starts again

MODELS OF MEMORY
CONSISTENCY
Consistency:
1.
When must a processor see a value that has
been updated by another processor
2.
In what order must a processor observe the
data writes of another processor
. Sequential consistency: Result of any execution
be the same as if the memory accesses executed
by each processor were kept in order and
accesses among different processors are
interleaved

Sequential consistency: Sequential

consistency requires that the result of any
execution be the same as if the memory accesses
executed by each processor were kept in order
and the accesses among different processors were
arbitrarily interleaved.

A program is synchronized if all accesses to

shared data are ordered by synchronization
operations
Data race: Variables are updated without
ordering by synchronization; execution outcome
depends on the relative speed of the processors
Synchronization operations?

RELAXED CONSISTENCY MODELS

Allow read and write to complete out of order; but
use synchronization operations to enforce
ordering
X->Y: Operation X must complete before Y
Four possible orderings: R->W; R->R; W->R; W>W

1.
2.
3.

Relaxing W -> R yields total store ordering or

processor consistency model
Relaxing W -> W ordering yields a model
known as partial store order
Relaxing R -> W and R -> R yields weak
ordering, release consistency model

1.
2.
3.
4.
5.

Define the four major categories of computer

systems
List the factors that led to the rise of MIMD
multi processors
Illustrate the basic architecture of a centralized
shared memory multi processor
Illustrate the basic architecture of a distributed
memory multi processor
Distinguish between private data and shared
data

6.
7.
8.
9.
10.

Define the cache coherence problem

List the conditions required for a memory
system to be coherent
Define the cache coherence protocols
Analyze the implementation of cache coherence
protocol
Illustrate the performance of symmetric shared
memory multi processors with a commercial
workload applicatio

11.
12.
13.
14.

Illustrate the working of distributed memory

multi processor
Demonstrate the transitions in a directory
based system
Define spin locks
Define the ordering of a relaxed consistency
model

EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
0% (1)
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
40 pages
Muge - Snoop Based Multiprocessor Design
No ratings yet
Muge - Snoop Based Multiprocessor Design
32 pages
Computer Networks Application Layer Notes
No ratings yet
Computer Networks Application Layer Notes
23 pages
Group 16 - Compiler Design
No ratings yet
Group 16 - Compiler Design
31 pages
Reducing Pipeline Branch Penalties
No ratings yet
Reducing Pipeline Branch Penalties
4 pages
OS Viva Question
No ratings yet
OS Viva Question
6 pages
Design of Power Efficient Posit Multiplier Using Compressor Based Adder
No ratings yet
Design of Power Efficient Posit Multiplier Using Compressor Based Adder
8 pages
PRAM Model
No ratings yet
PRAM Model
72 pages
Computer Architecture Assignment 3 (ARCH)
No ratings yet
Computer Architecture Assignment 3 (ARCH)
9 pages
Module-5_Pentium Processors_Final
No ratings yet
Module-5_Pentium Processors_Final
43 pages
Distributed Database System
No ratings yet
Distributed Database System
15 pages
RTOS Multitasking
No ratings yet
RTOS Multitasking
34 pages
Microprocessors (Bcst601)
No ratings yet
Microprocessors (Bcst601)
61 pages
Introduction To IBM PC Assembly Language: Chapter-4
No ratings yet
Introduction To IBM PC Assembly Language: Chapter-4
50 pages
Module 2 ACA Notes
100% (1)
Module 2 ACA Notes
31 pages
26 - Super Keyword in Java
No ratings yet
26 - Super Keyword in Java
19 pages
MCSE-103 by Mohd Abdullah
No ratings yet
MCSE-103 by Mohd Abdullah
9 pages
UNIT 5 (1) Notes of Adhoc
No ratings yet
UNIT 5 (1) Notes of Adhoc
13 pages
Dbms Unit 4.2
No ratings yet
Dbms Unit 4.2
60 pages
ARF Design Recruitment Details and Registration Link
No ratings yet
ARF Design Recruitment Details and Registration Link
2 pages
Session - 25 Subroutine Call Return Mechanisms
No ratings yet
Session - 25 Subroutine Call Return Mechanisms
15 pages
Agile Model (Software Engineering) - Javatpoint
No ratings yet
Agile Model (Software Engineering) - Javatpoint
4 pages
Software Construction Lecture 1
No ratings yet
Software Construction Lecture 1
30 pages
Chapter 4 (Processors and Memory Hierarchy)
100% (1)
Chapter 4 (Processors and Memory Hierarchy)
17 pages
Module 5 - Pentium Processors - Final
No ratings yet
Module 5 - Pentium Processors - Final
43 pages
Embedded Unit-4
100% (1)
Embedded Unit-4
12 pages
Multiprocessor and Multicomputers
No ratings yet
Multiprocessor and Multicomputers
5 pages
BCA COA Full Notes
No ratings yet
BCA COA Full Notes
83 pages
Microprocesor 3 Unit
No ratings yet
Microprocesor 3 Unit
47 pages
Bus Arbitration
No ratings yet
Bus Arbitration
7 pages
Operating Digital Notes (R22 Regulation)
No ratings yet
Operating Digital Notes (R22 Regulation)
156 pages
Module-1 Theory of Parallelism: The State of Computing Computer Development Milestones
No ratings yet
Module-1 Theory of Parallelism: The State of Computing Computer Development Milestones
48 pages
3rd Module CN Notes
No ratings yet
3rd Module CN Notes
57 pages
Memory System
No ratings yet
Memory System
51 pages
Computer Architecture and Parallel Processing
No ratings yet
Computer Architecture and Parallel Processing
29 pages
Page Replacement Algo
No ratings yet
Page Replacement Algo
2 pages
Industrial Extreme Programming: Submitted By: Group 3 Submitted To
No ratings yet
Industrial Extreme Programming: Submitted By: Group 3 Submitted To
7 pages
SMT and CMP Architectures
100% (3)
SMT and CMP Architectures
19 pages
Chapter 08 - Pipeline and Vector Processing
No ratings yet
Chapter 08 - Pipeline and Vector Processing
14 pages
09 - Thread Level Parallelism
50% (2)
09 - Thread Level Parallelism
34 pages
Unit v Programming Model
No ratings yet
Unit v Programming Model
53 pages
Classifications of MAC Protocols
100% (1)
Classifications of MAC Protocols
3 pages
William Stallings Computer Organization and Architecture 8 Edition
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition
55 pages
Segmentation and Paging & Page Replacement
No ratings yet
Segmentation and Paging & Page Replacement
12 pages
Advanced Computer Architecture: Program Flow Mechanisms
No ratings yet
Advanced Computer Architecture: Program Flow Mechanisms
14 pages
Operating System Programs
No ratings yet
Operating System Programs
93 pages
Chapter 3: Organization of Computer Systems: Objectives
No ratings yet
Chapter 3: Organization of Computer Systems: Objectives
21 pages
Stick Diag
No ratings yet
Stick Diag
6 pages
Unit 5 Dpco
No ratings yet
Unit 5 Dpco
20 pages
Backup and Recovery
No ratings yet
Backup and Recovery
35 pages
Lecture 5
No ratings yet
Lecture 5
66 pages
Ca Unit 4 Prabu
No ratings yet
Ca Unit 4 Prabu
24 pages
Parameters of Cache Memory: - Cache Hit - Cache Miss - Hit Ratio - Miss Penalty
No ratings yet
Parameters of Cache Memory: - Cache Hit - Cache Miss - Hit Ratio - Miss Penalty
18 pages
Computer Organisation and Architecture
No ratings yet
Computer Organisation and Architecture
6 pages
Tomasulo's Algorithm and Scoreboarding
No ratings yet
Tomasulo's Algorithm and Scoreboarding
17 pages
Module II
No ratings yet
Module II
22 pages
4.2 5-Stage Pipeline ARM Organization: Memory Bottle Neck
No ratings yet
4.2 5-Stage Pipeline ARM Organization: Memory Bottle Neck
6 pages
System Software Internals Advanced
No ratings yet
System Software Internals Advanced
48 pages
Systolic Array
No ratings yet
Systolic Array
42 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
Deh P3000R
No ratings yet
Deh P3000R
68 pages
Datasheet PDF
No ratings yet
Datasheet PDF
17 pages
Motoadmin Manual en 01
No ratings yet
Motoadmin Manual en 01
39 pages
Tweaks Win98se S YXG100
No ratings yet
Tweaks Win98se S YXG100
7 pages
CIS CS126 Introduction To Information & Communication Technologies AsharAli CS secABCDE 2022F
No ratings yet
CIS CS126 Introduction To Information & Communication Technologies AsharAli CS secABCDE 2022F
4 pages
Interview Questions On OBIEE
No ratings yet
Interview Questions On OBIEE
4 pages
Beckwith Software
No ratings yet
Beckwith Software
2 pages
salman2012
No ratings yet
salman2012
6 pages
C Online Test 4
No ratings yet
C Online Test 4
1 page
Power Off Reset Reason
No ratings yet
Power Off Reset Reason
3 pages
Signal Description of 8086 Microprocessor
100% (1)
Signal Description of 8086 Microprocessor
8 pages
Cs Winter 21 Gtu
No ratings yet
Cs Winter 21 Gtu
3 pages
Lecture 2 The PIC16F877 Memory Map & Assembly Programming
No ratings yet
Lecture 2 The PIC16F877 Memory Map & Assembly Programming
24 pages
MSWord Excel With ABAP
No ratings yet
MSWord Excel With ABAP
18 pages
Setting Up IPMI and Serial Over LAN On Red Hat Enterprise Linux 3 and 4
No ratings yet
Setting Up IPMI and Serial Over LAN On Red Hat Enterprise Linux 3 and 4
6 pages
Frangipani: A Scalable Distributed File System
No ratings yet
Frangipani: A Scalable Distributed File System
3 pages
Computer Organization Architecture MCQ - Avatto-Pahe15
No ratings yet
Computer Organization Architecture MCQ - Avatto-Pahe15
2 pages
Mame23 Software Eu en
No ratings yet
Mame23 Software Eu en
23 pages
1-Dynamic Program Analysis - Purdue CS PDF
No ratings yet
1-Dynamic Program Analysis - Purdue CS PDF
40 pages
How To Create A ROS Package (In Python) in ?? Easy Steps: Workspace Setup
No ratings yet
How To Create A ROS Package (In Python) in ?? Easy Steps: Workspace Setup
15 pages
Sample Paper Board Hsc1sci
No ratings yet
Sample Paper Board Hsc1sci
5 pages
Relay Protection Systems (Print Out)
No ratings yet
Relay Protection Systems (Print Out)
27 pages
Jayash Admane EL
No ratings yet
Jayash Admane EL
1 page
Harrison 950m
No ratings yet
Harrison 950m
4 pages
Interfacing Seven Segment Display To 8051: Electronic Circuits and Diagram-Electronics Projects and Design
No ratings yet
Interfacing Seven Segment Display To 8051: Electronic Circuits and Diagram-Electronics Projects and Design
18 pages
GEE002
No ratings yet
GEE002
8 pages
AXI Verification IP
No ratings yet
AXI Verification IP
54 pages
Basic Operational Concepts
67% (3)
Basic Operational Concepts
2 pages
Lecture 1
No ratings yet
Lecture 1
31 pages
GPS631B DataSheet(Ver1.0)
No ratings yet
GPS631B DataSheet(Ver1.0)
8 pages