Pipelining PDF
Pipelining PDF
Pipelining PDF
AverageInstructionTimeWithoutPipeline- = 260
Speedup = --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------ = 4
AverageInstructionTimeWithPipline 65
chow cs420/520-CH3-Pipelining-5/14/99--Page 1-
Pipeline Designer’s Goal and limits
l Goal → Balance the length of the pipeline stages.
50 50 50
50 50 50
50 100 100 100
50 50 50
50 divide stage 3 50 50
50 into two stages
unbalanced stages pipeline with
balanced stages parallelized stages
l More stages→fewer operations/stage→smaller clock cycle time/stage
→allow to maintain a low CPI.
l But clock cycle time > (latch overhead+clock skew)
and there is a limited number of operations to performance per instruction.
—Latches are used among stages to keep instruction’s intermediate values.
—Clock signals reach different stages at different time≡clock skew.
l Pipelining increases the CPU instruction throughput (no. of instructions/sec)
but does not reduce the execution time for an individual instruction.
In fact, the execution time increases due to the pipeline overhead.
(260 vs. 325 nsec).
chow cs420/520-CH3-Pipelining-5/14/99--Page 2-
DLX without Pipelining
The five cycle
Instruction Fetch (IF) Cycle:
IR ← Mem[PC]
NPC ← PC+4
Instruction Decoding/Register Fetch (ID) Cycle:
A ← Regs[IR6..10]
B ← Regs[IR11..15]
Imm ← ((IR16)16##IR16..31)
A and B are wo temporary registers.
Decoding is done in parallel with the reading of registers A and B.
Execution/effective address (EX) cycle:
chow cs420/520-CH3-Pipelining-5/14/99--Page 3-
DLX without Pipelining
EX and MEM Instruction Cycle
Execution/effective address (EX) cycle: (depend on instruction type)
l Memory reference: (Load or Store instructions)
ALUoutput ← A + Imm; (compute the address)
l Register-Register ALU instructions: (op indicated by instruction decoding)
ALUoutput ← A op B;
l Register-Immediate ALU instructions: (LI R3, #3)
ALUoutput ← A op Imm;
l Bracnch:
ALUoutput ← NPC + Imm; (branch target address, NPC=PC+4)
Cond ← (A op 0); (here op can be EQ or NE)
Memory access/branch completion (MEM) cycle:
l Memory reference:
LMD ← Mem[ALUoutput] or (load instruction)
Mem[ALUoutput] ← B; (store instruction);
l Branch:
if (cond) PC ← ALUoutput else PC ← NPC (next istruction)
chow cs420/520-CH3-Pipelining-5/14/99--Page 4-
DLX without Pipelining
WB Instruction Cycle
Write-back (WB) cycle:
l Load instruction:
Regs[IR11..15] ← LMD;
l Register-Register ALU instructions:
Regs[IR16..20] ← ALUoutput;
l Register-Immediate ALU instructions: (LI R3, #3)
Regs[IR11..15] ← ALUoutput;
chow cs420/520-CH3-Pipelining-5/14/99--Page 5-
DLX Datapath
chow cs420/520-CH3-Pipelining-5/14/99--Page 6-
Basic DLX Pipeline
Clock #
instruction # 1 2 3 4 5 6 7 8 9
instruction i IF ID EX MEM WB
instruction i+1 IF ID EX MEM WB
instruction i+2 IF ID EX MEM WB
instruction i+3 IF ID EX MEM WB
instruction i+4 IF ID EX MEM WB
IF: Instruction Fetch
ID: Instruction Decode
EX: Execution stage
MEM: Memory Stage
WB: Write Back (to register)
chow cs420/520-CH3-Pipelining-5/14/99--Page 7-
Pipeline Stages and Their Resource Utilization
write to register is
perform in first half
of WB stage so that
the result can be read
by the instruction
downstream in the 2nd
half of ID stage.
chow cs420/520-CH3-Pipelining-5/14/99--Page 8-
Pipeline Registers of DLX
NPC NPC
cond
IR A ALUoutput
addr
B LMD
Dout
B Din
Imm
registers ALUoutput
contain values IR IR IR
and control info from stage to stage
chow cs420/520-CH3-Pipelining-5/14/99--Page 9-
Instruction Execution on DLX Pipeline
ADDI R1, R0, #1000
0 56 1011 1516 31
Opcode rs1 rd Imm
ADD 0 1 1000
IF: IF/ID.IR ← Mem[PC];
IF/ID.NPC, PC ← (if EX/MEM.cond {EX/MEM.NPC} else {PC+4});
(at the initialization of pipeline, EX/MEM.cond is set to 0.)
ID: ID/EX.NPC ← IF/ID.NPC; ID/EX.IR ← IF/ID.IR;
ID/EX.A ← Regs[IFI/ID.IR6..10=0]; ID/EX.B ←Regs[IF/ID.IR11..15=1](not used)
ID/EX.Imm ← (IF/ID.IR16)16##IF/ID.IR16..31;
EX: EX/MEM.IR ← ID/EX.IR;
EX/MEM.ALUoutput ← ID/EX.A + ID/EX.Imm
EX/MEM.cond ← 0; (indicating not a branch)
MEM: MEM/WB.IR ← EX/MEM.IR;(why waste MEM step, not skip?)
MEM/WB.ALUoutput ← EX/MEM.ALUoutput;
WB: Regs[MEM/WB.IR11..15] ← MEM/WB.ALUoutput
chow cs420/520-CH3-Pipelining-5/14/99--Page 10-
Instruction Execution on DLX Pipeline
LW R2, 0(R1)
0 56 1011 1516 31
Opcode rs1 rd Imm
LW 1 2 0
IF: IF/IR.IR ← Mem[PC];
IF/ID.NPC, PC ← (if EX/MEM.cond {EX/MEM.NPC} else {PC+4});
(At the initialization of pipeline, EX/MEM.cond is set to 0.)
ID: ID/EX.NPC ← IF/ID.NPC; ID/EX.IR ← IF/ID.IR;
ID/EX.A ← Regs[IFI/ID.IR6..10=1]; ID/EX.B ← Regs[IF/ID.IR11..15];(not used)
ID/EX.Imm ← (IF/ID.IR16)16##IF/ID.IR16..31;
EX: EX/MEM.IR ← ID/EX.IR;
EX/MEM.ALUoutput ← ID/EX.A + ID/EX.Imm; (compute the address)
EX/MEM.cond ← 0; (indicating not a branch)
MEM: MEM/WB.IR ← EX/MEM.IR;
MEM/WB.LMD ← Mem[EX/MEM.ALUoutput];
WB: Regs[MEM/WB.IR11..15=2] ← MEM/WB.LMD
chow cs420/520-CH3-Pipelining-5/14/99--Page 11-
Instruction Execution on DLX Pipeline
BEQZ R2, L
0 56 1011 1516 31
Opcode rs1 rd Imm
BEQZ 2 not used 4
IF: IF/IR.IR ← Mem[PC];
IF/ID.NPC, PC ← (if EX/MEM.cond {EX/MEM.NPC} else {PC+4});
(at the initialization of pipeline, EX/MEM.cond is set to 0.)
ID: ID/EX.NPC ← IF/ID.NPC; ID/EX.IR ← IF/ID.IR;
ID/EX.A ← Regs[IFI/ID.IR6..10=2]; ID/EX.B ← Regs[IF/ID.IR11..15];(not used)
ID/EX.Imm ← (IF/ID.IR16)16##IF/ID.IR16..31;
EX: EX/MEM.ALUoutput ← ID/EX.NPC + ID/EX.Imm; (branch target address)
EX/MEM.cond ← (ID/EX.A == 0)
Note that EX/MEM.cond affect the PC value, therefore the fetch of next instruction
(the 3rd instruction after BEQZ). The 1st and 2nd instruction after BEQZ, in IF and
ID stages, will have to be aborted if the branch is taken.
Assume instr1
is not load or store;
otherwise...
all instructions
after SUB are stalled
one cycle too.
Predict-taken
l Apply to situations where the target address is known before the branch
outcome.
l For DLX pipeline, the target address and branch outcome are known at the
same stage. Therefore, there is no advantage using this scheme.
l There is always one clock-cycle stall, if predict-taken scheme is used in DLX
pipeline.
Delay branch slots: The sequential instructions between branch instruction and
branch target instruction are called in delay branch slots. They will be executed no
matter what. Therefore better no affect the computation if branch is taken.
Improves performance
Scheduling strategy Requirements
when?
From before branch Branch must not depend Always
on the results of resched-
uled instructions
From targe Must be OK to execute When branch is taken.
rescheduled instruction if May enlarge program if
branch is not taken. May instructions are dupli-
need to duplicate instruc- cated
tion
From fall through Must be OK to execute When branch is not
instructions if branch is taken.
taken
Structural Hazard
instruction 1 2 3 4 5 6 7 8 9 10 11
MULTD F0, F4, F6 IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB
… IF ID EX MEM WB
… IF ID EX MEM WB
ADDD F2, F4, F6 IF ID A1 A2 A3 A4 MEM WB
… IF ID EX MEM WB
… IF ID EX MEM WB
LD F8, 0(R2) IF ID EX MEM WB
l At clock cycle 11, all three instructions try to write to the regiester file. If there
is only one port, this is a structural hazard.
# of stall cycles for each types of FP operations. It is about 46% to 59% of the latency of the
functional units. It is generated by running 5 FP SPEC benchmarks.
The compiler tries to schedule both load and FP delays before it schedules
branch delays.Why?
a) Identify the data hazards that can be solved by the forwarding techniques.
b) Are there stalls due to data hazard even though the forwarding techniques is used?
How would you improve that?
c) Are there control hazards? How would you resolve that?
LWF2, 1500(R0) F D X M B
SLLIR5, R1, #2 F D X M B
LWF7, 6000(R5) F D X M B
LWF6, 5000(R5) F D X M B
MULTFF7, F2, F7 F D X1 X2 X3 X4 X5 X6 X7 M B
ADDIR1, R1, #1 F D X M B
SLER8, R1, R3 F D X M B
SUBFF7, F6, F7 F D S S S S X1 X2 X3 X4 M B
BNEZR8, L1 F S S S S D X M B
SW6000(R5), F7 F D S S X M
e) How many time we have to unroll the loop to avoid the stall comletely?
Ans: