Here are the key steps in designing a pipelined processor:
1. Identify the stages in the instruction execution pipeline, such as fetch, decode, execute, memory, writeback.
2. Associate functional units and resources with each pipeline stage. For example, register file access in execute stage.
3. Examine the datapath and control signals to ensure data and resource dependencies flow correctly through the pipeline with no conflicts.
4. Add pipeline registers between stages to break up instruction flow into discrete packets and enable overlapped execution.
5. Design pipeline control logic to assert appropriate control signals in each stage, such as read/write enables for register file.
6. Handle exceptions like hazards properly with
Report
Share
Report
Share
1 of 82
More Related Content
Pipeline
1. CS 211: Computer Architecture
Instructor: Prof. Bhagi Narahari
Dept. of Computer Science
Course URL:
www.seas.gwu.edu/~narahari/cs211/
CS211 1
2. How to improve performance?
• Recall performance is function of
– CPI: cycles per instruction
– Clock cycle
– Instruction count
• Reducing any of the 3 factors will lead to
improved performance
CS211 2
3. How to improve performance?
• First step is to apply concept of pipelining to
the instruction execution process
– Overlap computations
• What does this do?
– Decrease clock cycle
– Decrease effective CPU time compared to original
clock cycle
• Appendix A of Textbook
– Also parts of Chapter 2
CS211 3
4. Pipeline Approach to Improve
System Performance
• Analogous to fluid flow in pipelines and
assembly line in factories
• Divide process into “stages” and send tasks
into a pipeline
– Overlap computations of different tasks by
operating on them concurrently in different stages
CS211 4
5. Instruction Pipeline
• Instruction execution process lends itself
naturally to pipelining
– overlap the subtasks of instruction fetch, decode
and execute
CS211 5
6. Linear Pipeline 3
Processor
Linear pipeline processes a sequence of
subtasks with linear precedence
At a higher level - Sequence of processors
Data flowing in streams from stage S1 to the
final stage Sk
Control of data flow : synchronous or
asynchronous
S1 S2 • • • • Sk
CS211 6
7. Synchronous Pipeline 4
All transfers simultaneous
One task or operation enters the pipeline per cycle
Processors reservation table : diagonal
CS211 7
8. Time Space Utilization of Pipeline
Full pipeline after 4 cycles
S3 T1 T2
T1
S2 T2 T3
T1 T2 T3 T4
S1
1 2 3 4
Time (in pipeline cycles)
CS211 8
9. Asynchronous 5
Pipeline
Transfers performed when individual processors are ready
Handshaking protocol between processors
Mainly used in multiprocessor systems with message-passing
CS211 9
10. Pipeline Clock and 6
Timing
Si Si+1
τ τm d
Clock cycle of the pipeline : τ
Latch delay : d
τ = max {τ m } + d
Pipeline frequency : f
f=1/τ CS211 10
11. Speedup and 7
Efficiency
k-stage pipeline processes n tasks in k + (n-1) clock
cycles:
k cycles for the first task and n-1 cycles
for the remaining n-1 tasks
Total time to process n tasks
Tk = [ k + (n-1)] τ
For the non-pipelined processor
T1 = n k τ
Speedup factor
T1 nkτ nk
Sk = = [ k + (n-1)] τ = k + (n-1)
Tk
CS211 11
12. 10
Efficiency and Throughput
Efficiency of the k-stages pipeline :
Sk n
Ek = =
k k + (n-1)
Pipeline throughput (the number of tasks per unit time) :
note equivalence to IPC
n nf
Hk = =
[ k + (n-1)] τ k + (n-1)
CS211 12
13. Pipeline Performance: Example
• Task has 4 subtasks with time: t1=60, t2=50, t3=90, and t4=80 ns
(nanoseconds)
• latch delay = 10
• Pipeline cycle time = 90+10 = 100 ns
• For non-pipelined execution
– time = 60+50+90+80 = 280 ns
• Speedup for above case is: 280/100 = 2.8 !!
• Pipeline Time for 1000 tasks = 1000 + 4-1= 1003*100 ns
• Sequential time = 1000*280ns
• Throughput= 1000/1003
• What is the problem here ?
• How to improve performance ?
CS211 13
14. Non-linear pipelines and pipeline
control algorithms
• Can have non-linear path in pipeline…
– How to schedule instructions so they do no conflict
for resources
• How does one control the pipeline at the
microarchitecture level
– How to build a scheduler in hardware ?
– How much time does scheduler have to make
decision ?
CS211 14
15. Non-linear Dynamic Pipelines
• Multiple processors (k-stages) as linear pipeline
• Variable functions of individual processors
• Functions may be dynamically assigned
• Feedforward and feedback connections
CS211 15
16. Reservation Tables
• Reservation table : displays time-space flow of data
through the pipeline analogous to opcode of pipeline
– Not diagonal, as in linear pipelines
• Multiple reservation tables for different functions
– Functions may be dynamically assigned
• Feedforward and feedback connections
• The number of columns in the reservation table :
evaluation time of a given function
CS211 16
18. Latency Analysis
• Latency : the number of clock cycles
between two initiations of the pipeline
• Collision : an attempt by two initiations to use
the same pipeline stage at the same time
• Some latencies cause collision, some not
CS211 18
20. 17
Latency Cycle
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
x1 x1 x2 x1 x2 x3 x2 x3
x1 x1 x2 x2 x3 x3
x1 x1 x1 x2 x2 x2 x3 x3
Cycle Cycle
Latency cycle : the sequence of initiations which has repetitive
subsequence and without collisions
Latency sequence length : the number of time intervals
within the cycle
Average latency : the sum of all latencies divided by
the number of latencies along the cycle
CS211 20
21. 18
Collision Free Scheduling
Goal : to find the shortest average latency
Lengths : for reservation table with n columns, maximum forbidden
latency is m <= n – 1, and permissible latency p is
1 <= p <= m – 1
Ideal case : p = 1 (static pipeline)
Collision vector : C = (CmCm-1 . . .C2C1)
[ Ci = 1 if latency i causes collision ]
[ Ci = 0 for permissible latencies ]
CS211 21
22. 19
Collision Vector
Reservation Table
x x
x x
x x
x1 x1 x2
x2
x1 x1 x2 x2
x1 x1 x2 x2
Value X1 Value X2
C = (? ? . . . ? ?)
CS211 22
23. Back to our focus: Computer
Pipelines
• Execute billions of instructions, so
throughput is what matters
• MIPS desirable features:
– all instructions same length,
– registers located in same place in instruction
format,
– memory operands only in loads or stores
CS211 23
24. Designing a Pipelined
Processor
• Go back and examine your datapath and control
diagram
• associated resources with states
• ensure that flows do not conflict, or figure out how
to resolve
• assert control in appropriate stage
CS211 24
25. 5 Steps of MIPS Datapath
What do we need to do to pipeline the process ?
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC
MUX
Next SEQ PC
Adder
4 RS1
Zero?
MUX MUX
RS2
Address
Memory
Reg File
Inst
ALU
Memory
RD L
Data
M
MUX
D
Sign
Imm Extend
WB Data
CS211 25
26. 5 Steps of MIPS/DLX Datapath
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC
MUX
Next SEQ PC Next SEQ PC
Adder
4 RS1
Zero?
MUX MUX
MEM/WB
Address
Memory
EX/MEM
RS2
Reg File
ID/EX
IF/ID
ALU
Memory
Data
MUX
WB Data
Sign
Extend
Imm
RD RD RD
• Data stationary control
– local decode for each instruction phase / pipeline stage CS211 26
27. Graphically Representing
Pipelines
• Can help with answering questions like:
– how many cycles does it take to execute this code?
– what is the ALU doing during cycle 4?
– use this representation to help understand datapaths
CS211 27
28. Visualizing Pipelining
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
I
ALU
n Ifetch Reg DMem Reg
s
t
r.
ALU
Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d
e
r
ALU
Ifetch Reg DMem Reg
CS211 28
29. Conventional Pipelined Execution
Representation
Time
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
Program Flow
IFetch Dcd Exec Mem WB
CS211 29
30. Single Cycle, Multiple Cycle, vs.
Clk
PipelineCycle 1 Cycle 2
Single Cycle Implementation:
Load Store Waste
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Clk
Multiple Cycle Implementation:
Load Store R-type
Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch
Pipeline Implementation:
Load Ifetch Reg Exec Mem Wr
Store Ifetch Reg Exec Mem Wr
R-type Ifetch Reg Exec Mem Wr
CS211 30
31. The Five Stages of
Load Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Load Ifetch Reg/Dec Exec Mem Wr
• Ifetch: Instruction Fetch
– Fetch the instruction from the Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode
• Exec: Calculate the memory address
• Mem: Read the data from the Data Memory
• Wr: Write the data back to the register file
CS211 31
32. The Four Stages of R-
type Cycle 1 Cycle 2 Cycle 3 Cycle 4
R-type Ifetch Reg/Dec Exec Wr
• Ifetch: Instruction Fetch
– Fetch the instruction from the Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode
• Exec:
– ALU operates on the two register operands
– Update PC
• Wr: Write the ALU output back to the register file
CS211 32
33. Pipelining the R-type and Load
Instruction 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
Cycle 1 Cycle 2 Cycle Cycle 9
Clock
R-type Ifetch Reg/Dec Exec Wr Ops! We have a problem!
R-type Ifetch Reg/Dec Exec Wr
Load Ifetch Reg/Dec Exec Mem Wr
R-type Ifetch Reg/Dec Exec Wr
R-type Ifetch Reg/Dec Exec Wr
• We have pipeline conflict or structural hazard:
– Two instructions try to write to the register file at the same
time!
– Only one write port
CS211 33
34. Important
• Observation
Each functional unit can only be used once per
instruction
• Each functional unit must be used at the same stage
for all instructions:
– Load uses Register File’s Write Port during its 5th stage
1 2 3 4 5
Load Ifetch Reg/Dec Exec Mem Wr
– R-type uses Register File’s Write Port during its 4th stage
1 2 3 4
R-type Ifetch Reg/Dec Exec Wr
° 2 ways to solve this pipeline hazard.
CS211 34
35. Solution 1: Insert “Bubble” into the
Pipeline2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Cycle 1 Cycle
Clock
Ifetch Reg/Dec Exec Wr
Load Ifetch Reg/Dec Exec Mem Wr
R-type Ifetch Reg/Dec Exec Wr
R-type Ifetch Reg/Dec Pipeline Exec Wr
R-type Ifetch Bubble Reg/Dec Exec Wr
Ifetch Reg/Dec Exec
• Insert a “bubble” into the pipeline to prevent 2 writes
at the same cycle
– The control logic can be complex.
– Lose instruction fetch and issue opportunity.
• No instruction is started in Cycle 6!
CS211 35
36. Solution 2: Delay R-type’s Write by
• One Cycle
Delay R-type’s register write by one cycle:
– Now R-type instructions also use Reg File’s write port at Stage 5
– Mem stage is a NOOP stage: nothing is being done.
1 2 3 4 5
R-type Ifetch Reg/Dec Exec Mem Wr
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock
R-type Ifetch Reg/Dec Exec Mem Wr
R-type Ifetch Reg/Dec Exec Mem Wr
Load Ifetch Reg/Dec Exec Mem Wr
R-type Ifetch Reg/Dec Exec Mem Wr
R-type Ifetch Reg/Dec Exec Mem Wr
CS211 36
37. Why
Pipeline?
• Suppose we execute 100 instructions
• Single Cycle Machine
– 45 ns/cycle x 1 CPI x 100 inst = 4500 ns
• Multicycle Machine
– 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns
• Ideal pipelined machine
– 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
CS211 37
38. Why Pipeline? Because the resources
are there! Time (clock cycles)
ALU
I Im Reg Dm Reg
n Inst 0
s
ALU
t Inst 1 Im Reg Dm Reg
r.
ALU
O Inst 2 Im Reg Dm Reg
r
d Inst 3
ALU
Im Reg Dm Reg
e
r
Inst 4
ALU
Im Reg Dm Reg
CS211 38
39. Problems with Pipeline
processors?
• Limits to pipelining: Hazards prevent next instruction from
executing during its designated clock cycle and introduce
stall cycles which increase CPI
– Structural hazards: HW cannot support this combination of
instructions - two dogs fighting for the same bone
– Data hazards: Instruction depends on result of prior
instruction still in the pipeline
» Data dependencies
– Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow
(branches and jumps).
» Control dependencies
• Can always resolve hazards by stalling
• More stall cycles = more CPU time = less performance
– Increase performance = decrease stall cycles
CS211 39
40. Back to our old friend: CPU time
equation
• Recall equation for CPU time
• So what are we doing by pipelining the instruction
execution process ?
– Clock ?
– Instruction Count ?
– CPI ?
» How is CPI effected by the various hazards ?
CS211 40
42. One Memory Port/Structural Hazards
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
I Load Ifetch
ALU
Reg DMem Reg
n
s
ALU
Reg
t Instr 1
Ifetch Reg DMem
r.
ALU
Ifetch Reg DMem Reg
Instr 2
O
r
ALU
Ifetch Reg DMem Reg
d Instr 3
e
r
ALU
Ifetch Reg DMem Reg
Instr 4
CS211 42
43. One Memory Port/Structural Hazards
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
I Load Ifetch
ALU
Reg DMem Reg
n
s
ALU
Reg
t Instr 1
Ifetch Reg DMem
r.
ALU
Ifetch Reg DMem Reg
Instr 2
O
r
Stall Bubble Bubble Bubble Bubble Bubble
d
e
r
ALU
Ifetch Reg DMem Reg
Instr 3
CS211 43
44. Example: Dual-port vs. Single-port
• Machine A: Dual ported memory (“Harvard
Architecture”)
• Machine B: Single ported memory, but its
pipelined implementation has a 1.05 times faster
clock rate
• Ideal CPI = 1 for both
• Note - Loads will cause stalls of 1 cycle
• Recall our friend:
– CPU = IC*CPI*Clk
» CPI= ideal CPI + stalls
CS211 44
45. Example…
• Machine A: Dual ported memory (“Harvard
Architecture”)
• Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
SpeedUpA = Pipe. Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipe. Depth/(1 + 0.4 x 1) x
(clockunpipe/(clockunpipe / 1.05)
= (Pipe. Depth/1.4) x 1.05
= 0.75 x Pipe. Depth
SpeedUpA / SpeedUpB = Pipe. Depth/(0.75 x Pipe. Depth) = 1.33
• Machine A is 1.33 times faster
CS211 45
46. Data Dependencies
• True dependencies and False dependencies
– false implies we can remove the dependency
– true implies we are stuck with it!
• Three types of data dependencies defined in
terms of how succeeding instruction depends
on preceding instruction
– RAW: Read after Write or Flow dependency
– WAR: Write after Read or anti-dependency
– WAW: Write after Write
CS211 46
47. Three Generic Data Hazards
• Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
• Caused by a “Dependence” (in compiler
nomenclature). This hazard results from an actual
need for communication.
CS211 47
48. RAW Dependency
• Example program (a) with two instructions
– i1: load r1, a;
– i2: add r2, r1,r1;
• Program (b) with two instructions
– i1: mul r1, r4, r5;
– i2: add r2, r1, r1;
• Both cases we cannot read in i2 until i1 has
completed writing the result
– In (a) this is due to load-use dependency
– In (b) this is due to define-use dependency
CS211 48
49. Three Generic Data Hazards
• Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5
CS211 49
50. Three Generic Data Hazards
• Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
• Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
• Will see WAR and WAW in later more complicated
pipes
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
CS211 50
51. WAR and WAW Dependency
• Example program (a):
– i1: mul r1, r2, r3;
– i2: add r2, r4, r5;
• Example program (b):
– i1: mul r1, r2, r3;
– i2: add r1, r4, r5;
• both cases we have dependence between i1 and i2
– in (a) due to r2 must be read before it is written into
– in (b) due to r1 must be written by i2 after it has been written
into by i1
CS211 51
52. What to do with WAR and WAW ?
• Problem:
– i1: mul r1, r2, r3;
– i2: add r2, r4, r5;
• Is this really a dependence/hazard ?
CS211 52
53. What to do with WAR and WAW
• Solution: Rename Registers
– i1: mul r1, r2, r3;
– i2: add r6, r4, r5;
• Register renaming can solve many of these
false dependencies
– note the role that the compiler plays in this
– specifically, the register allocation process--i.e., the
process that assigns registers to variables
CS211 53
54. Hazard Detection in H/W
• Suppose instruction i is about to be issued and a
predecessor instruction j is in the instruction pipeline
• How to detect and store potential hazard information
– Note that hazards in machine code are based on register
usage
– Keep track of results in registers and their usage
» Constructing a register data flow graph
• For each instruction i construct set of Read registers
and Write registers
– Rregs(i) is set of registers that instruction i reads from
– Wregs(i) is set of registers that instruction i writes to
– Use these to define the 3 types of data hazards
CS211 54
55. Hazard Detection in
Hardware
• A RAW hazard exists on register ρ if ρ ∈ Rregs( i ) ∩ Wregs( j )
– Keep a record of pending writes (for inst's in the pipe) and
compare with operand regs of current instruction.
– When instruction issues, reserve its result register.
– When on operation completes, remove its write reservation.
• A WAW hazard exists on register ρ if ρ ∈ Wregs( i ) ∩ Wregs( j )
• A WAR hazard exists on register ρ if ρ ∈ Wregs( i ) ∩ Rregs( j )
CS211 55
56. Internal Forwarding: Getting rid of
some hazards
• In some cases the data needed by the next
instruction at the ALU stage has been
computed by the ALU (or some stage defining
it) but has not been written back to the
registers
• Can we “forward” this result by bypassing
stages ?
CS211 56
57. Data Hazard on R1
Time (clock cycles)
IF ID/RF EX MEM WB
I
ALU
Reg
add r1,r2,r3 Ifetch Reg DMem
n
s
t
ALU
Ifetch Reg DMem Reg
sub r4,r1,r3
r.
ALU
O Ifetch Reg DMem Reg
and r6,r1,r7
r
d
ALU
Ifetch Reg DMem Reg
e or r8,r1,r9
r
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
CS211 57
58. Forwarding to Avoid Data Hazard
Time (clock cycles)
I
n
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
s
t
r.
ALU
sub r4,r1,r3 Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d and r6,r1,r7
e
r
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
CS211 58
59. Internal Forwarding of
Instructions
• Forward result from ALU/Execute unit to
execute unit in next stage
• Also can be used in cases of memory access
• in some cases, operand fetched from memory has
been computed previously by the program
– can we “forward” this result to a later stage thus
avoiding an extra read from memory ?
– Who does this ?
• Internal forwarding cases
– Stage i to Stage i+k in pipeline
– store-load forwarding
– load-store forwarding
– store-store forwarding
CS211 59
60. Internal Data 38
Forwarding
Store-load forwarding
Memory Memory
M M
Access Unit Access Unit
R1 R2 R1 R2
STO M,R1 LD R2,M STO M,R1 MOVE R2,R1
CS211 60
61. Internal Data 39
Forwarding
Load-load forwarding
Memory Memory
M M
Access Unit Access Unit
R1 R2 R1 R2
LD R1,M LD R2,M LD R1,M MOVE R2,R1
CS211 61
62. Internal Data 40
Forwarding
Store-store forwarding
Memory Memory
M M
Access Unit Access Unit
R1 R2 R1 R2
STO M,R2
STO M, R1 STO M,R2
CS211 62
63. HW Change for Forwarding
NextPC
mux
Registers
MEM/WR
EX/MEM
ALU
ID/EX
Data
mux
Memory
mux
Immediate
CS211 63
64. What about memory
operations?in the same stage,
º If instructions are initiated in order and
operations always occur op Rd Ra Rb
there can be no hazards between memory
operations!
º What does delaying WB on arithmetic
operations cost?
– cycles ? op Rd Ra Rb A B
– hardware ?
º What about data dependence on loads?
R1 <- R4 + R5
R2 <- Mem[ R2 + I ] Rd D R
R3 <- R2 + R1
⇒ “Delayed Loads”
Mem
º Can recognize this in decode stage and Rd
introduce bubble while stalling fetch stage
T
º Tricky situation: to reg
R1 <- Mem[ R2 + I ] file
Mem[R3+34] <- R1
Handle with bypass in memory stage!
CS211 64
65. Data Hazard Even with Forwarding
Time (clock cycles)
I
ALU
lw r1, 0(r2) Ifetch Reg DMem Reg
n
s
t
ALU
Ifetch Reg DMem Reg
sub r4,r1,r6
r.
O
ALU
Ifetch Reg DMem Reg
and r6,r1,r7
r
d
e
ALU
Ifetch Reg DMem Reg
r or r8,r1,r9
CS211 65
66. Data Hazard Even with Forwarding
Time (clock cycles)
I
n
lw r1, 0(r2)
ALU
Ifetch Reg DMem Reg
s
t
r.
ALU
sub r4,r1,r6 Ifetch Reg Bubble DMem Reg
O
r
d Bubble
ALU
Ifetch Reg DMem Reg
e and r6,r1,r7
r
ALU
Bubble Ifetch Reg DMem
or r8,r1,r9
CS211 66
68. Software Scheduling to Avoid Load
Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code: Fast code:
LW Rb,b LW Rb,b
LW Rc,c LW Rc,c
ADD Ra,Rb,Rc LW Re,e
SW a,Ra ADD Ra,Rb,Rc
LW Re,e LW Rf,f
LW Rf,f SW a,Ra
SUB Rd,Re,Rf SUB Rd,Re,Rf
CS211 68
SW d,Rd SW d,Rd
69. Control Hazards: Branches
• Instruction flow
– Stream of instructions processed by Inst. Fetch unit
– Speed of “input flow” puts bound on rate of
outputs generated
• Branch instruction affects instruction flow
– Do not know next instruction to be executed until
branch outcome known
• When we hit a branch instruction
– Need to compute target address (where to branch)
– Resolution of branch condition (true or false)
– Might need to ‘flush’ pipeline if other instructions
have been fetched for execution
CS211 69
70. Control Hazard on Branches
Three Stage Stall
ALU
10: beq r1,r3,36 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5
ALU
Ifetch Reg DMem Reg
18: or r6,r1,r7
ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9
ALU
Reg
36: xor r10,r1,r11 Ifetch Reg DMem
CS211 70
71. Branch Stall Impact
• If CPI = 1, 30% branch,
Stall 3 cycles => new CPI = 1.9!
• Two part solution:
– Determine branch taken or not sooner, AND
– Compute taken branch address earlier
• MIPS branch tests if register = 0 or ≠ 0
• MIPS Solution:
– Move Zero test to ID/RF stage
– Adder to calculate new PC in ID/RF stage
– 1 clock cycle penalty for branch versus 3
CS211 71
72. Pipelined MIPS (DLX) Datapath
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc. Access Back
This is the correct 1 cycle
latency implementation!
CS211 72
73. Four Branch Hazard Alternatives
#1: Stall until branch direction is clear – flushing pipe
#2: Predict Branch Not Taken
– Execute successor instructions in sequence
– “Squash” instructions in pipeline if branch actually taken
– Advantage of late pipeline state update
– 47% DLX branches not taken on average
– PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
– 53% DLX branches taken on average
– But haven’t calculated branch target address in DLX
» DLX still incurs 1 cycle branch penalty
» Other machines: branch target known before outcome
CS211 73
74. Four Branch Hazard Alternatives
#4: Delayed Branch
– Define branch to take place AFTER a following instruction
branch instruction
sequential successor1
sequential successor2
........
sequential successorn Branch delay of length n
branch target if taken
– 1 slot delay allows proper decision and branch target
address in 5 stage pipeline
– DLX uses this
CS211 74
76. Delayed Branch
• Where to get instructions to fill branch delay slot?
– Before branch instruction
– From the target address: only valuable when branch taken
– From fall through: only valuable when branch not taken
– Cancelling branches allow more slots to be filled
• Compiler effectiveness for single branch delay slot:
– Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots useful
in computation
– About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: 7-8 stage pipelines, multiple
instructions issued per clock (superscalar)
CS211 76
77. Evaluating Branch Alternatives
P ip e lin e s p e e d u p = P ip e lin e d e p th
1 + B r a n c h f r e q u e n c y × B r a n c h p e n a lty
Scheduling Branch CPI speedup v. speedup v.
scheme penalty unpipelined stall
Stall pipeline 3 1.42 3.5 1.0
Predict taken 1 1.14 4.4 1.26
Predict not taken 1 1.09 4.5 1.29
Delayed branch 0.5 1.07 4.6 1.31
Conditional & Unconditional = 14%, 65% change PC
CS211 77
78. Branch Prediction based on
history
• Can we use history of branch behaviour to predict
branch outcome ?
• Simplest scheme: use 1 bit of “history”
– Set bit to Predict Taken (T) or Predict Not-taken (NT)
– Pipeline checks bit value and predicts
» If incorrect then need to invalidate instruction
– Actual outcome used to set the bit value
• Example: let initial value = T, actual outcome of
branches is- NT, T,T,NT,T,T
– Predictions are: T, NT,T,T,NT,T
» 3 wrong (in red), 3 correct = 50% accuracy
• In general, can have k-bit predictors: more when we
cover superscalar processors.
CS211 78
79. Summary :
Control and Pipelining
• Just overlap tasks; easy if tasks are independent
• Speed Up ≤ Pipeline Depth; if ideal CPI is 1, then:
Pipeline depth Cycle Timeunpipelined
Speedup = ×
1 + Pipeline stall CPI Cycle Timepipelined
• Hazards limit performance on computers:
– Structural: need more HW resources
– Data (RAW,WAR,WAW): need forwarding, compiler
scheduling
– Control: delayed branch, prediction
CS211 79
80. Summary #1/2:
•Pipelining it easy
What makes
– all instructions are the same length
– just a few instruction formats
– memory operands appear only in loads and stores
• What makes it hard? HAZARDS!
– structural hazards: suppose we had only one memory
– control hazards: need to worry about branch instructions
– data hazards: an instruction depends on a previous
instruction
• Pipelines pass control information down the pipe just
as data moves down pipe
• Forwarding/Stalls handled by local control
• Exceptions stop the pipeline
CS211 80
81. Introduction to ILP
• What is ILP?
– Processor and Compiler design techniques that
speed up execution by causing individual machine
operations to execute in parallel
• ILP is transparent to the user
– Multiple operations executed in parallel even
though the system is handed a single program
written with a sequential processor in mind
• Same execution hardware as a normal RISC
machine
– May be more than one of any given type of
hardware
CS211 81
82. Compiler vs. Processor
Compiler Hardware
Frontend and Optimizer
Superscalar
Determine Dependences Dataflow Determine Dependences
Determine Independences Indep. Arch. Determine Independences
Bind Operations to Function Units Bind Operations to Function Units
VLIW
Bind Transports to Busses Bind Transports to Busses
TTA
Execute
B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The
Journal of Supercomputing, 7(1-2):9-50, May 1993.
CS211 82
Editor's Notes
This presentation contains some slides about data security
Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. In the multiple clock cycle implementation, however, we cannot start executing the store until Cycle 6 because we must wait for the load instruction to complete. Similarly, we cannot start the execution of the R-type instruction until the store instruction has completed its execution in Cycle 9. In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction. Consequently, the cycle time for the Single Cycle implementation can be five times longer than the multiple cycle implementation. But may be more importantly, since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted.
As shown here, each of these five steps will take one clock cycle to complete. And in pipeline terminology, each step is referred to as one stage of the pipeline.
Let’s take a look at the R-type instructions. The R-type instruction does NOT access data memory so it only takes four clock cycles, or in our new pipeline terminology, four stages to complete. The Ifetch and Reg/Dec stages are identical to the Load instructions. Well they have to be because at this point, we do not know we have a R-type instruction yet. Instead of calculating the effective address during the Exec stage, the R-type instruction will use the ALU to operate on the register operands. The result of this ALU operation is written back to the register file during the Wr back stage.
What happened if we try to pipeline the R-type instructions with the Load instructions? Well, we have a problem here!!! We end up having two instructions trying to write to the register file at the same time! Why do we have this problem (the write “bubble”)?
The first solution is to insert a “bubble” into the pipeline AFTER the load instruction to push back every instruction after the load that are already in the pipeline by one cycle. At the same time, the bubble will delay the Instruction Fetch of the instruction that is about to enter the pipeline by one cycle. Needless to say, the control logic to accomplish this can be complex. Furthermore, this solution also has a negative impact on performance. Notice that due to the “extra” stage (Mem) Load instruction has, we will not have one instruction finishes every cycle (points to Cycle 5). Consequently, a mix of load and R-type instruction will NOT have an average CPI of 1 because in effect, the Load instruction has an effective CPI of 2. So this is not that hot an idea Let’s try something else.
Well one thing we can do is to add a “Nop” stage to the R-type instruction pipeline to delay its register file write by one cycle. Now the R-type instruction ALSO uses the register file’s witer port at its 5th stage so we eliminate the write conflict with the load instruction. This is a much simpler solution as far as the control logic is concerned. As far as performance is concerned, we also gets back to having one instruction completes per cycle. This is kind of like promoting socialism: by making each individual R-type instruction takes 5 cycles instead of 4 cycles to finish, our overall performance is actually better off. The reason for this higher performance is that we end up having a more efficient pipeline.