ECE 552 Chapter 2 - The Basics: Natalie Enright Jerger
ECE 552 Chapter 2 - The Basics: Natalie Enright Jerger
Recap
Evalua:on
Metrics
Cost,
Power,
Reliability Performance
Amdahls
Law
Fall
2012 ECE
552
(Enright
Jerger):
Basics 2
Applica)on
This
lecture
Firmware I/O
OS
Compiler CPU Memory Digital
Circuits Gates
and
Transistors
Data
Hazards
Hardware:
stalling
and
bypassing
Instruc:on execu:on assumed atomic Instruc:on X nishes before insn X+1 starts
PC
I$
Register File
s1 s2 d
D$
Control
Datapath
FuncGonal
units
(ALUs),
registers,
memory
interface
Execu:on
(E)
ALU
performs
arithme:c/logic
opera:on
(arithme:c
insn) Compute
memory
address
(load/store
insn) If
branch,
compute
target
PC
and
update
Writeback
(W)
Write
instruc:on
result
to
register
Fall
2012 ECE
552
(Enright
Jerger):
Basics 6
Datapath:
Mul:-Cycle
+ 4
PC
I$
IR
Register File
s1 s2 d
A B O D
D$
Control
Add
latches
to
create
mulG-cycle
implementaGon Load
is
going
to
have
longest
path
through
datapath
Fall
2012 ECE
552
(Enright
Jerger):
Basics 7
Fall 2012
PC
I$
IR
Register File
s1 s2 d
A B O D
D$
Control
PC
I$
IR
Register File
s1 s2 d
A B O D
D$
Control
A = Regs[s1] B = Regs[s2]
10
PC
I$
IR
Register File
s1 s2 d
A B O D
D$
Control
Execute
(E)
O
=
A
+
Imm32
(Memory
opera:ons) O
=
A
op
B
(Reg-Reg/Arithme:c) O
=
A
op
Imm32
(Reg-Imm/Arithme:c) Branch:
ALU2
=
NPC
+
Imm32;
Cond
(A
==
0)? if
(Cond)
PC
=
ALU2 ECE
552
(Enright
Jerger):
Basics
Fall 2012
11
PC
I$
IR
Register File
s1 s2 d
A B O D
D$
Control
D = Mem[O] Mem[O] = B
12
PC
I$
IR
Register File
s1 s2 d
A B O D
D$
Control
Writeback
(W)
Reg[d]
=
O
(Reg-Reg/Arithme:c) Reg[d]
=
O
(Reg-Imm/Arithme:c) Reg[d]
=
D
(Load) ECE
552
(Enright
Jerger):
Basics
Fall 2012
13
Quick
Review
Single-cycle insn0.fetch,
dec,
exec Mul)-cycle
insn1.fetch,
dec,
exec insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec
MulG-cycle
Branch
20%
(3
cycles),
load:
20%
(5
cycles),
ALU:
60%
(4
cycles) Clock
period
=
11ns,
CPI
=
(0.2*3+0.2*5+0.6*4)
=
4
Why
11ns
clock
period
and
not
10ns?
Performance
=
44ns/insn
Fall
2012 ECE
552
(Enright
Jerger):
Basics 15
Fall 2012
16
Mul)-cycle Pipelined
Pipelining
insn0.fetch insn0.dec insn0.exec insn1.fetch insn0.fetch insn0.dec insn1.fetch insn0.exec insn1.dec insn1.exec insn1.dec insn1.exec
Snapshot
t t+1 t+2 t+3 t+4 t+5 t+6 t+7 t+8 t+9 Insn
i Insn
i+1
D F
X D F
M X D F
W M X D F
W M X D F
W M X D
Insn i+2
W M X
Insn i+3
W M
5 instrucGons in progress
PC
I$
Register File
s1 s2 d
IR
A O B IR B IR
D$
D IR
19
Pipeline
Terminology
PC + 4 PC
PC
I$
Register File
s1 s2 d
IR
A O B IR B IR
D$
D IR
PC
F/D
D/X
X/M
M/W
Pipeline
Control
PC + 4 PC A
PC
I$
IR
Register File
s1 s2 d
O O D
B IR xC
D$
B IR mC wC IR wC
CTRL
mC wC
Instruc:on
Conven:on
Some
ISAs
(example:
MIPS)
Instruc:on
des:na:on
(i.e.
output)
on
the
lec
add
r1,
r2,
r3
means
r1
<--
r2+r3
Other
ISAs
Instruc:on
des:na:on
(i.e.
output)
on
the
right
add
r1,
r2,
r3
means
r1+r2
-->
r3
PC
I$
Register File
s1 s2 d
O O
B SX
B IR
D$
PC
IR
IR
IR
D/X
X/M
M/W
3
instruc:ons
Fall
2012 ECE
552
(Enright
Jerger):
Basics 24
PC
I$
Register File
s1 s2 d
O O
B SX
B IR
D$
PC
IR
IR
IR
D/X
X/M
M/W
Fall 2012
25
PC
I$
Register File s1 s2 d
A O B B SX IR IR
D$
PC
IR IR
IR
sw r6 [r7+4]
lw [r5+0] r4
add r1, r2 r3
Fall 2012
26
PC
I$
Register File s1 s2 d
B SX IR
D$
PC
IR IR
IR
IR
D/X
X/M lw [r5+0] r4
Fall 2012
27
PC
I$
Register File s1 s2 d
O D
B SX IR
D$
PC
IR IR
IR
IR
F/D
D/X sw r6 [r7+4]
X/M
lw [r5+0] r4
Fall 2012
28
PC
I$
Register File s1 s2 d
O D
B SX IR
D$
PC
IR IR
IR
IR
F/D
D/X
X/M
M/W sw r6 [r7+4] lw
Fall 2012
29
PC
I$
Register File s1 s2 d
O D
B SX IR
D$
PC
IR IR
IR
IR
F/D
D/X
X/M
M/W sw
Fall 2012
30
Pipeline
Diagram
add r1,r2
r3 ld [r5]
r4 st r6
[r7+4] 1 F 2 3 D X F D F 4 5 6 7 M W X M W D X M W 8 9
Pipeline
diagram
Cycles
across,
insns
down Conven:on:
X
means
ld [r5]
r4
nishes
execute
stage
and
writes
into
X/M
latch
at
end
of
cycle
4
Fall 2012
31
Pipelining
Balanced
All
stages
must
take
approximately
the
same
:me Doesnt
make
sense
to
op:mize
a
stage
whose
processing
:me
is
not
longest
Buering
Not
all
stages
take
exactly
the
same
:me Independent
computa:ons
No
rela:onships
between
work
units Minimize
pipeline
stalls
Fall 2012
32
Single-cycle Mul:-cycle
Branch 3 cycles, load 5 cycles, ALU 4 cycles Clock period = 11 ns, CPI = (0.2 * 3 + 0.2 *5 + 0.6 * 4) = 4 Performance = 44 ns/insn Clock period = 12ns (approx 50ns/5 stages + overheads) CPI = 1 (each insn takes 5 cycles, but 1 completes each cycle) Performance = 12ns/insn Actually ... CPI = 1 + some penalty for pipelining Say CPI = 1.5 (on average instrucGon completes every 1.5 cycles) Performance = 18ns/insn
Pipelined
Fall 2012
33
Managing
a
Pipeline
Proper
ow
requires
two
pipeline
opera:ons Opera:on
I:
stall
Mess
with
latch
write-enable
and
clear
signals
to
achieve
Eect: stops some insns in their current stages Use: make younger insns wait for older ones to complete Implementa:on: de-assert write-enable
Eect: removes insns from current stages Use: see later Implementa:on: assert clear signals
37
Data
Hazards/Dependence
Lets
forget
about
branches
and
control
for
a
while
3
insn
seq
from
earlier
example
add
r3,
r2,
r1 lw
r4,
0(r5) sw
r6,
0(r7)
Fall 2012
38
RAW
Read-aMer-write
(RAW)
add r2,r3r1 sub r1,r4r2 or r6,r3r1 Problem:
swap
would
mean
sub uses
wrong
value
for
r1 True:
value
ows
through
this
dependence
Using
dierent
output
register
for
add
doesnt
help
ECE
552
(Enright
Jerger):
Basics
Fall 2012
39
Dependent
Opera:ons
Independent
opera:ons
add
r1,
r2
r3 add
r4,
r5
r6
+ 4
<<
2
PC
PC A
PC
I$
Register File
s1 s2 d
O B O B IR SX IR IR
D$
This
one?
add
r1,
r2
r3 ld
[r3]
r4, addi
r3,
1
r6 st
r3
[r7]
PC
IR
F/D
D/X
X/M
M/W
read
r3,
r5
ECE
552
(Enright
Jerger):
Basics
compute
r3
=
r2
+
r1
40
Fall 2012
I$
Register File
s1 s2 d
O B SX O B IR IR
D$
PC
IR
IR
F/D
D/X
X/M
M/W
sw
r3
[r7]
Fall
2012
addi r3, 1 r6
lw [r3] r4
add r1,r2 r3
Re-evaluated
every
cycle
un:l
no
longer
true + Low
cost,
simple IPC
degrada:on,
dependences
are
the
common
case
Fall
2012 ECE
552
(Enright
Jerger):
Basics 42
ID
IR D/X
EX
IR X/M
MEM
IR M/W
WB
hazard
6 X D F
7 M X D
8 W M X
10
D X M W F d* d* D p* p* F
W M
Fall 2012
44
MEM
WB
add r1, r2 r3
IF
ID
EX
MEM
WB
ld [r3] r4
IF
ID
r3
read
EX
r3
needed
MEM
WB
+Reduces stalls in a big way Addi:onal wires and muxes may increase clock cycle
Fall 2012
47
ID
D/X
EX
lw
[r3+4]
r4
MEM
add
r1,
r2
r3
WB
M/W
X/M
Bypass
Logic
Register File
s1 s2
IR A O B SX IR IR IR O
D$
F/D
D/X
X/M
M/W
bypass
Why?
49
W 7 8 9 10
Example: WM bypass
add r2,r3r1 ?
1 F
2 D F
Register File
s1 s2
IR
O B SX IR
B IR
D$
IR
F/D
stall
add r2, r3 r4
lw [r2+4] r3
D$
F/D
D$
F/D
D$
F/D
Load-Use
Stalls
Even
with
full
bypassing,
stall
logic
is
unavoidable
Load-use
stall
Load
value
not
ready
at
beginning
of
M
cant
use
MX
bypass
Use
WX
bypass
1 F
2 3 4 5 6 7 8 D X M W F d* D X M W
9 10
Aside
I:
how
does
stall/bypass
logic
handle
cache
misses? Aside
II:
compiler
scheduling
can
be
used
to
reduce
load-use
stall
frequency
Fall
2012 ECE
552
(Enright
Jerger):
Basics
55
Calculate
CPI
CPI
=
1
+
(1
*
0.20
*
0.50)
=
1.1
Fall 2012
56
WAW
Hazards
Write-aMer-write
(WAW)
add r2,r3
r1 sub r1,r4
r2 or r3,r6
r1
Compiler
eects
Scheduling
problem:
reordering
would
leave
wrong
value
in
r1
Later
instruc:on
reading
r1
would
get
wrong
value
Pipeline
eects
Doesnt
aect
in-order
pipeline
with
single-cycle
opera:ons
One
reason
for
making
ALU
opera:ons
go
through
M
stage
WAR
Hazards
Write-aMer-read
(WAR)
add r3,r2r1 sub r4,r5r2 or r1,r3r6
Compiler
eects
Scheduling
problem:
reordering
would
mean
add
uses
wrong
value
for
r2 Ar/cial:
solve
using
dierent
output
register
name
for
sub
Pipeline
eects
Cant
happen
in
simple
in-order
pipeline Can
happen
with
out-of-order
execuGon
Fall
2012 ECE
552
(Enright
Jerger):
Basics 58
Structural
Hazards
ld [r1]
r2 add r4,r3
r1 sub r5,r3
r1 and r3,r4
r6
1 F
2 D F
3 X D F
4 M X D F
5 W M X D
6 W M X
W M
s*
=
structural
stall Q:
which
one
to
stall:
ld
or
and? Always
safe
to
stall
younger
instruc:on
(here
and)
But
not
always
the
best
thing
to
do
performance
wise
(?) + Low
cost,
simple Increases
CPI Upshot:
beter
to
avoid
by
design
than
to
x
Fetch
stall
logic:
(X/M.op
==
ld
||
X/M.op
==
st)
Fall 2012
61
Control
Hazards
Pipeline
works
well
when
there
is
no
transfer
of
control F
fetches
next
sequen:al
instruc:on Problem
when
sequen:al
ow
is
disrupted
First,
look
at
steps
need
for
branch:
br
(Rj
op
Rk)
displ Comparison
between
Rj
and
Rk Set
ag
for
outcome
of
comparison Compute
target
address:
PC
+
displ
(if
necessary) ModicaGon
of
PC
(if
necessary) Add
an
ALU
to
ID
stage
to
compute
the
target
address
Fall
2012 ECE
552
(Enright
Jerger):
Basics 62
Control
Hazards
Branch
InstrucGon Branch
decision
known
at
this
stage
F
If
taken,
2
instruc/ons
are
wrong
F F F
PC
is
correct,
fetch
the
right
instruc/on
What
to
do?
Fall
2012 ECE
552
(Enright
Jerger):
Basics 63
Control
Hazards
Default:
assume
not-taken
(at
fetch,
cant
tell
its
a
branch) Control
hazards
indicated
with
c*
(or
not
at
all) Taken
branch
penalty
is
2
cycles At
decode,
know
its
a
branch
and
stall Insert
no-ops
for
2
cycles 1 addi r1,1r3 bnez r3,targ st r6[r7+4]
Fall
2012
2 D F
3 X D
4 M X
5 W M F
6 W D
c* c*
ECE
552
(Enright
Jerger):
Basics
W
64
Fall 2012
65
1 F
2 D F
3 X D F
specula)ve
4 M X D F
5 W M X D F
W M X D
W M X
W M
Speculate:
Predict
branch
outcome Simple
predic:on:
predict
not-taken Mis-specula)on
recovery:
what
to
do
on
wrong
guess
Not
too
painful
in
an
in-order
pipeline Branch
resolves
in
X + Younger
insns
(in
F,
D)
havent
changed
permanent
state Flush
insns
currently
in
F/D
and
D/X
(i.e.,
replace
with
nops)
Fall
2012 ECE
552
(Enright
Jerger):
Basics 66
Mis-specula:on
recovery
Recovery:
1 F
2 D F
3 X D F
4 M X D F
5 W M --F
6 W --D
--X
-M
Branch
Predic:on
Simple
strategy
assumes
branch
not
taken But...
taken
branches
are
more
common Want
a
predic:ve
strategy
Yield
taken
or
not
taken
predic:on
with
high
probability
of
being
right Jump
ahead
to
Chapter
4
(to
prepare
you
for
assignment)
Specula4ve
execu4on
Execute
before
all
parameters
known
with
certainty
Correct
specula4on
+ Avoid
stall,
improve
performance
Mechanics
Guess
branch
target,
start
fetching
at
guessed
posi:on Execute
branch
to
verify
(check)
guess
Correct
specula:on?
keep
going Mis-specula:on?
Flush
mis-speculated
insns Dont
write
registers
or
memory
un:l
predic:on
veried
Fall 2012
70
Control
Specula:on
Specula:on
game
for
in-order
5
stage
pipeline
Gain
=
2
cycles Penalty
=
0
cycles
No
penalty
mis-specula:on
no
worse
than
stalling
Fall 2012
71
Fall 2012
72
Event
selec:on:
Branchs,
but
can
make
predic:ons
on
all
insn
(ignore
for
nonbranch)
Predictor
indexing:
Recovery? Feedback Branch
outcome Update
predicGon
mechanism Update
history
Fall
2012
Predictor
mechanism
Sta:c
vs.
Dynamic
74
State/prediction Outcome Two
built-in
mis-predicGons
per
inner
loop
iteraGon Branch
predictor
changes
its
mind
too
quickly
Fall
2012 ECE
552
(Enright
Jerger):
Basics 76
Fall 2012
77
Strong
taken:
last
two
instances
were
taken Weak
taken:
last
instance
N
but
previous
T Weak
not-taken:
last
instance
T
but
previous
N
Ini:al
state
State/prediction Outcome
Clearly branch 3 is correlated to the behaviour of branches 1 and 2 Predictor that uses single branch to predict outcome cannot capture this behaviour Global scheme
Fall 2012
81
2k saturaGng counters
82
Correlated
Predictor
Correlated
(two-level)
predictor
[Paz]
Exploits
observaGon
that
branch
outcomes
are
correlated Maintains
separate
predicGon
per
(PC,
BHR)
Branch
history
register
(BHR):
recent
branch
outcomes
Fall 2012
Correlated
Predictor
What
happened?
BHR
wasnt
long
enough
to
capture
the
pazern Try
again:
BHT+3BHR:
8
1-bit
DIRP
entries
State/prediction BHR=NNN BHR=NNT BHR=NTN active pattern BHR=NTT BHR=TNN BHR=TNT BHR=TTN BHR=TTT Outcome
Correlated
Predictor
Design
choice
I:
one
global
BHR
or
one
per
PC
(local)?
Each
one
captures
dierent
kinds
of
pazerns Global
is
bezer,
captures
local
pazerns
for
:ght
loop
branches
Fall 2012
86
Hybrid
Predictor
Hybrid
(tournament)
predictor
[McFarling]
Atacks
correlated
predictor
BHT
uGlizaGon
problem Idea:
combine
two
predictors
Simple
PHT
predicts
history
independent
branches Correlated
predictor
predicts
only
branches
that
need
history Chooser
assigns
branches
to
one
predictor
or
the
other Branches
start
in
simple
PHT,
move
mis-predicGon
threshold
+ Correlated predictor can be made smaller, handles fewer branches + 9095% accuracy
PHT
GHR
Fall 2012
PHT
chooser
PC
87
Handling
Interrupts/Excep:ons
Instructions before X (X-1, X-2...) in program order currently in pipeline Complete normally Results are part of saved process state Instruction X and instructions after it already in the pipeline Converted into nops Saved PC corresponds to the PC of instruction X Program will restart at X Called precise state or precise interrupts
Fall
2012 ECE
552
(Enright
Jerger):
Basics 90
Fall 2012
91
Handle
/0
excep:on
rst Will
revisit
excep:on
handling
for
more
complex
pipelines
later
Fall
2012 ECE
552
(Enright
Jerger):
Basics 92
Lets
look
at
EX
Fast
integer
arithme:c
and
logic
opera:ons Single
cycle Slow
integer
arithme:c
opera:ons:
mul:ply,
divide Pipelined
(except
div
Floa:ng
point
opera:ons:
add,
mul:ply,
divide,
sqrt and
sqrt)
Fall
2012 ECE
552
(Enright
Jerger):
Basics 93
MEM
M5 M6 M7
A1
A2
A3
A4
div
Long
opera:ons:
RAW
stalls
will
be
more
frequent WAW
hazards
are
possible:
insns
no
longer
reach
WB
in
order WAR
hazards
are
not
possible:
register
reads
always
occur
in
D
Fall
2012 ECE
552
(Enright
Jerger):
Basics 95
2 D F
3 E/ D F
4 E/ d* D
8 W M
9 W
10
E/ E/ E/ d* d* X E+ E+ W F D
--
addf f2,f3f4
E+ E+
What
to
do?
Op:on
I:
stall
younger
instruc:on
(addf)
at
writeback
+ Intui:ve,
simple Lower
performance,
cascading
W
structural
hazards
Control
hazards
Branches:
ush
instruc:ons
when
branch
taken Branch
predic:on
97
Fall 2012
98
Fall 2012
99