Advanced Linux Programming
Advanced Linux Programming
Advanced Linux Programming
CS1403
Pipelined Data-Path
Assume 30 min. each task – wash, dry, fold, store – and that
separate tasks use separate hardware and so can be overlapped
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A Pipelined
B
D
Pipelined vs. Single-Cycle
Program
execution 2 4 6 8 10 12 14 16 18
order Time
(in instructions)
Instruction Data Single-cycle
lw $1, 100($0) fetch
Reg ALU
access
Reg
Instruction Data
lw $2, 200($0) 8 ns fetch
Reg ALU
access
Reg
Instruction
lw $3, 300($0) 8 ns fetch
...
8 ns
Instruction Data
Pipelined
lw $2, 200($0) 2 ns Reg ALU Reg
fetch access
Instruction Data
lw $3, 300($0) 2 ns Reg ALU Reg
fetch access
2 ns 2 ns 2 ns 2 ns 2 ns
Pipelining: Keep in Mind
• Pipelining does not reduce latency of a single task,
it increases throughput of entire workload
• Pipeline rate limited by longest stage
– potential speedup = number pipe stages
– unbalanced lengths of pipe stages reduces speedup
• Time to fill pipeline and time to drain it – when
there is slack in the pipeline – reduces speedup
Pipelining MIPS
• What makes it easy with MIPS?
– all instructions are same length
• so fetch and decode stages are similar for all instructions
– just a few instruction formats
• simplifies instruction decode and makes it possible in one stage
– memory operands appear only in load/stores
• so memory access can be deferred to exactly one later stage
– operands are aligned in memory
• one data transfer instruction requires one memory access stage
Pipelining MIPS
• What makes it hard?
– structural hazards: different instructions, at different stages, in the
pipeline want to use the same hardware resource
– control hazards: succeeding instruction, to put into pipeline, depends
on the outcome of a previous branch instruction, already in pipeline
– data hazards: an instruction in the pipeline requires data to be
computed by a previous instruction still in the pipeline
Instruction Data
Hazard if single memory
lw $3, 300($0) 2 ns Reg ALU Reg
fetch access
Instruction Data
lw $4, 400($0) Reg ALU Reg
2 ns fetch access
2 ns 2 ns 2 ns 2 ns 2 ns
Program
execution 2 4 6 8 10 12 14 16
order Time
(in instructions)
Instruction Data
add $4, $5, $6 fetch
Reg ALU
access
Reg Note that branch outcome is
Instruction Data computed in ID stage with
beq $1, $2, 40 Reg ALU Reg
2ns fetch access
added hardware (later…)
Instruction Data
lw $3, 300($0) bubble Reg ALU Reg
fetch access
4 ns 2ns
Pipeline stall
Control Hazards
• Solution 2 Predict branch outcome
– e.g., predict branch-not-taken :
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5, $6 fetch
Reg ALU
access
Reg
Instruction Data
beq $1, $2, 40 Reg ALU Reg
2 ns fetch access
Instruction Data
lw $3, 300($0) Reg ALU Reg
2 ns fetch access
Prediction success
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5 ,$6 Reg ALU Reg
fetch access
Instruction Data
beq $1, $2, 40 Reg ALU Reg
fetch access
2 ns
bubble bubble bubble bubble bubble
Instruction Data
or $7, $8, $9 Reg ALU Reg
fetch access
4 ns
Prediction failure: undo (=flush) lw
Control Hazards
Solution 3 Delayed branch: always execute the sequentially next
statement with the branch executing after one instruction delay –
compiler’s job to find a statement that can be put in the slot that is
independent of branch outcome
MIPS does this
2 4 6 8 10
Time
IF ID EX
Instruction pipeline diagram:
add $s0, $t0, $t1 MEM WB
shade indicates use –
left=write, right=read
Program
execution 2 4 6 8 10
order Time
(in instructions)
Without forwarding – blue line
add $s0, $t0, $t1 IF ID EX MEM WB
– data has to go back in time;
with forwarding – red line
sub $t2, $s0, $t3 IF ID EX MEM WB
– data is available in time
2 4 6 8 10 12 14
Program Time
execution
order
(in instructions)
• Reordered code:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t0, 4($t1)
Interchanged
sw $t2, 0($t1)
Pipelined Datapath
• We now move to actually building a pipelined datapath
• First recall the 5 steps in instruction execution
1. Instruction Fetch & PC Increment (IF)
2. Instruction Decode and Register Read (ID)
3. Execution or calculate address (EX)
4. Memory access (MEM)
5. Write result into register (WB)
• Review: single-cycle processor
– all 5 steps done in a single clock cycle
– dedicated hardware required for each step
• What happens if we break the execution into multiple cycles, but keep
the extra hardware?
Review - Single-Cycle Data-path “Steps”
ADD
4 ADD
PC <<2
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1 Zero
Register File ALU
WD
RD2 M
U ADDR
X
Data RD M
16
E
X 32
Memory U
X
T WD
N
D
EX
Execute/ Address
IF ID Calc. MEM WB
Instruction Fetch Instruction Decode Memory Access Write Back
Pipelined Datapath – Key Idea
• What happens if we break the execution into multiple cycles,
but keep the extra hardware?
– Answer: We may be able to start executing a new instruction at each
clock cycle - pipelining
• …but we shall need extra registers to hold data between
cycles – pipeline registers
Pipelined Datapath
Pipeline registers wide enough to hold data coming in
ADD
4 ADD
64 bits 128 bits
PC <<2 97 bits 64 bits
Instruction I
ADDR RD
32 32
Instruction 16 5 5 5
4 ADD
64 bits 128 bits
PC <<2 97 bits 64 bits
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
RN1 RN2 WN
Memory
RD1
Zero
Register File ALU
WD
RD2 M
U ADDR
X
Data
E MemoryRD M
U
16 X 32 X
T WD
N
D
4 ADD
PC Instruction I <<2
ADDR RD
32 16 32
5 5 5
Instruction
RN1 RN2 WN
Memory RD1
Register File ALU
WD
RD2 M
U ADDR
X
Data RD M
16
E
X 32
Memory U
X
T WD
N
D
16 X 32
Memory U
X
T WD
N
5 D
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8
Time axis
IM REG ALU DM REG
lw $t0, 10($t1)