Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

L 4 Multithreading

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Why Multithreading Today?

 ILP is exhausted, TLP is in.


 Large performance gap bet. MEM and PROC.
 Too many transistors on chip
 More existing MT applications Today.
 Multiprocessors on a single chip.
 Long network latency, too.

CSE431 L28 CMP&SMT.1 Irwin, PSU, 2005


Contemporary forms of parallelism
 Instruction-level parallelism(ILP)
 Wide-issue Superscalar processors (SS)
 4 or more instruction per cycle
 Executing a single program or thread
 Attempts to find multiple instructions to issue each cycle.
 Thread-level parallelism(TLP)
 Fine-grained multithreaded superscalars(FGMS)
 Contain hardware state for several threads
 Executing multiple threads
 On any given cycle a processor executes instructions from
one of the threads
 Multiprocessor(MP)
Performance improved by adding more CPUs

CSE431 L28 CMP&SMT.2 Irwin, PSU, 2005


Requirements of Multithreading

 Storage need to hold multiple context’s PC, registers,


status word, etc.
 Coordination to match an event with a saved context
 A way to switch contexts
 Long latency operations must use resources not in use

CSE431 L28 CMP&SMT.3 Irwin, PSU, 2005


Multithreading on A Chip
 Find a way to “hide” true data dependency stalls, cache
miss stalls, and branch stalls by finding instructions (from
other process threads) that are independent of those
stalling instructions
 Multithreading – increase the utilization of resources on a
chip by allowing multiple processes (threads) to share the
functional units of a single processor
 Processor must duplicate the state hardware for each thread – a
separate register file, PC, instruction buffer, and store buffer for
each thread
 The caches can be shared (although the miss rates may increase
if they are not sized accordingly)
 The memory can be shared through virtual memory mechanisms
 Hardware must support efficient thread context switching

CSE431 L28 CMP&SMT.4 Irwin, PSU, 2005


Types of Multithreading
 Fine-grain – switch threads on every instruction issue
 Round-robin thread interleaving (skipping stalled threads)
 Processor must be able to switch threads on every clock cycle
 Advantage – can hide throughput losses that come from both
short and long stalls
 Disadvantage – slows down the execution of an individual
thread since a thread that is ready to execute without stalls is
delayed by instructions from other threads

CSE431 L28 CMP&SMT.5 Irwin, PSU, 2005


Fine Multithreading

Thread A Thread B Thread C Thread D

Skip A

CSE431 L28 CMP&SMT.6 Irwin, PSU, 2005


Types of Multithreading

 Coarse-grain – switches threads only on costly stalls


(e.g., L2 cache misses)
 Advantages – thread switching doesn’t have to be essentially
free and much less likely to slow down the execution of an
individual thread
 Disadvantage – limited, due to pipeline start-up costs, in its
ability to overcome throughput loss
- Pipeline must be flushed and refilled on thread switches

CSE431 L28 CMP&SMT.7 Irwin, PSU, 2005


Conceptual Diagram
(Similar to fig 7.5 in text)

Thread A Thread B Thread C Thread D

CSE431 L28 CMP&SMT.8 Irwin, PSU, 2005


Coarse Multithreading

Stalls for A and C would be longer than indicated in previous slide


Assume long stalls at end of each thread indicated in previous slide

CSE431 L28 CMP&SMT.9 Irwin, PSU, 2005


SIMULTANEOUS
MULTITHREADING

CSE431 L28 CMP&SMT.10 Irwin, PSU, 2005


Simultaneous Multithreading
 Key idea
Issue multiple instructions from multiple threads each
cycle
 Features
 Fully exploit thread-level parallelism and instruction-
level parallelism.
 Better Performance
Mix of independent programs
Programs that are parallelizable
Single threaded program

CSE431 L28 CMP&SMT.11 Irwin, PSU, 2005


Multithreading(FGMT) SMT Superscalar(SS)
Time (Processor cycle)

Unutilized
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5

Issue slots

CSE431 L28 CMP&SMT.12 Irwin, PSU, 2005


Multiprocessor vs. SMT
Multiprocessor(MP2) SMT
Time (Processor cycle)

Unutilized
Thread 1
Thread 2

CSE431 L28 CMP&SMT.13 Irwin, PSU, 2005


SMT Architecture(1)
 Base Processor: like out-of-order superscalar
processor.[MIPS R10000]
 Changes: With N simultaneous running threads,
need N PC and N subroutine return stacks and
more than N*32 physical registers for register
renaming in total.

CSE431 L28 CMP&SMT.14 Irwin, PSU, 2005


SMT Architecture(2)

 Need large register files, longer register


access time, pipeline stages are added.
[Register reads and writes each take 2
stages.]
Fetch Decode Renaming Queue Reg Read Reg Read Exec Reg Write Commit

 Share the cache hierarchy and branch


prediction hardware.
 Each cycle: select up to 2 threads and each
fetch up to 4 instructions.(2.4 scheme)

CSE431 L28 CMP&SMT.15 Irwin, PSU, 2005


Simultaneous Multithreading
Thread A Thread B Thread C Thread D

Skip C

Skip A

CSE431 L28 CMP&SMT.16 Irwin, PSU, 2005


Simultaneous Multithreading (SMT)
 SMT name first used by UW; Earlier versions from UCSB [Nemirovsky, HICSS‘91] and [Hirata et
al., ISCA-92]
 Intel’s HyperThreading (2-way SMT)
 IBM Power7 (4/6/8 cores, 4-way SMT); IBM Power5/6 (2 cores. Each 2-way SMT, 4
chips per package) : Power5 has OoO cores, Power6 In-order cores;
 Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources

Fdiv, unpipe
(16 cycles)
Fetch RS && ROB
ROB
Unit
Decode FMult
plus (4 cycles)
Physical Reg
Reg
Register FAdd Reg
Reg
Register
Register Register
Register FileReg
FileReg
PC Register
Register
RRename rr (2 cyc) FileReg
FileReg
PC Register
ename
RRename
Register
rr File File
PC
PC ename
Register File
File
PC RRename rr File
PC Rename

ALU2 ALU1
PC
PC Renamerr
ename

I-CACHE
Load/Store D-CACHE
(variable)
CSE431 L28 CMP&SMT.17 Irwin, PSU, 2005
Multithreading(FGMT) SMT Superscalar(SS)
Time (Processor cycle)

Unutilized
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5

Issue slots

CSE431 L28 CMP&SMT.18 Irwin, PSU, 2005


Multiprocessor vs. SMT
Multiprocessor(MP2) SMT
Time (Processor cycle)

Unutilized
Thread 1
Thread 2

CSE431 L28 CMP&SMT.19 Irwin, PSU, 2005


SMT Pipeline

Fetch Decod Queue Reg Execut Dcache Reg Retire


e/Map Read e /Store Write
Buffer

PC

Register
Map
Regs Dcache Regs
Icache

CSE431 L28 CMP&SMT.20 Irwin, PSU, 2005

You might also like