Slide 4

Computer Architecture
Design, Analysis, Execution and Optimization of Instructions

Datapath & CU for Pipelined Microprocessor: MIPS
2
Objectives
•Design the processor such that
•Clock period (T) should be lesser than a
single-cycle processor [similar to a
multi-cycle one]
• IPC (1/CPI) should be 1
2
Comparison of Single & Multi-Cycle MIPS Processor
4
Problems of Multi-cycle Processor
• The fundamental problem
• Split the slowest instruction, lw, 5-steps
• The processor’s clock cycle time does not improve 5-times, 185 ps
• The steps take unequal length of time
• Only one stage is busy and the remaining stages are idle
• 5-non-architectural registers and an additional multiplexer
Multi-cycle Single-cycle
Instructions
(Clock-cycle) (Clock-cycle) Single Cycle: Non-shared FUs, CPI =1 or IPC=1, clock period (Tsingle)= slowest
LW 5 1 instr. in ISA
SW 4 1 Multi-cycle: Shared FUs, CPI > 1 or IPC <1, clock period: Tmulti < Tsingle
R-type 4 1
BEQ 3 1 Can we have a microprocessor like: IPC=1 & clock period [< Tmulti <
ADDI 4 1 Tsingle]?
J 3 1 Cycles Per Instruction (CPI) Program Execution time: #instr. x CPI x Clk (T)
CPI >1 1
Instructions Per Cycle/Seconds (IPC) = 1/CPI
Lesser than More than
CLK Period: T
Single-cycle Multi-Cycle
5
Problems of Multi-cycle Processor
Only one stage is busy and remaining stages are idle at anytime
Book- P&H-COD
3
Pipeline in a Chemical Plant
Additive
Steam
Water
Filter Mixer
Boiler
7
Pipeline in the Instruction Execution
Memory Words Results

Instruction Instruction Instruction
Fetch Decode Execution
[Stage-1] [Stage-2] [Stage-3]
Stage-1 1 2 3 4 5
Stage-2 1 2 3 4
Stage-3 1 2 3
Time
What is the difference in this Analogical or Parallel reasoning?

8
Pipelined MIPS-based processor
• Partitioning the Instruction Executional cycle (function)
• Subfunctions
• Input of one subfunction TOTALLY comes from output of
previous subfunctions
• Other than inputs & outputs, there are no interrelationships between
subfunctions
• Hardware may be developed (stage) to execute each subfunction
• Each hardware units’ evaluations are usually approximately equal
9
•Powerful way to improve the throughput • Partitioning the Instruction
Executional cycle (function)
•Divide the single-cycle implementation • Subfunctions
• Input of one subfunction
• Fetch TOTALLY comes from output of
• Decode previous subfunctions
• Other than inputs & outputs,
• Execute there are no interrelationships
between subfunctions
• Memory • Hardware may be developed
• Writeback (stage) to execute each
subfunction
• A commercial MIPS processor: R2000/R3000 • Each hardware units’
evaluations are usually
approximately equal
Latency of each instructions is unchanged, but throughput is ideally 5-times better

10

•Stage elements
• Reading & writing the memory
• Register file
• ALU operation
•Each stage takes almost same amount of time
• Consists of one element
11
Comparison of timing diagram

• Delay of the elements Element Parameter Delay (ps)
Register clk-to-Q Tpcq 30
Register setup Tsetup 20
Multiplexer Tmux 25
ALU TALU 200
Memory read Tmem 250
Register file read tRFread 150
Register file write tRWrite 100
Register file setup tRFsetup 20

Comparison of timing diagram 12
• Delay of MUX & register is not included
Timing diagram of (a) single-cycle processor (b) pipelined processor

Book- P&H-COD
13
Comparison of timings
•Single-cycle processor •Pipelined processor

• Instruction latency is 950 ps • Length of pipeline stage is 250
• Throughput 1 instruction ps (mem. access)
per 950 ps • Instruction latency is 5*250 =
• 1.05 billions instruction per 1250 ps
second • Throughput 1 instruction per
250 ps
• 4 billions instructions per
seconds
14
A view of pipeline in operation
• Resource utilization
Book- P&H-COD
Delay elements and stage registers 15
IF_ID ID_EXE EXE_MEM MEM_WB
P
C IM RF DM
ALU
250 PS 150 PS 200 PS 250 PS
Delay values are from the previous table.

Datapath for R-type: ADD R1, R2, R3 16
op rs rt rd shamt funct
6-bits(31-26) 5-bits(25-21) 5-bits(20-16) 5-bits(15-11) 5-bits(10-6) 6-bits (5-0)
IF_ID ID_EXE EXE_MEM
ALU
P IM RF
C
Datapath for R-type: ADD R1, R2, R3 17
ALU
P IM RF
C
Reg. File’s write operation @posedge

&
Stage Reg.’s write operation @negedge
Datapath for B-type: BEQ R1, R3, offset 18
op rs rt offset Control Hazards:
6-bits(31-26) 5-bits(25-21) 5-bits(20-16) 5-bits(15-11) 5-bits(10-6) 6-bits (5-0) when BEQ is in
3rd stage, which
instruction will
be in stage-1 and
2?
P IM RF
C
ALU
Datapath for J-type: J Offset 19
op address
IF_ID
P
IM
C
Datapath for I-type: LW R1, #5( R3) 20
op rs rt Offset
P
C IM RF DM
ALU
Datapath for I-type: SW R1, #5( R3) 21
op rs rt Offset
P
C IM RF DM
ALU
Datapath for LW R1, #5( R3) & R-type 22
P
C IM RF DM
ALU
23
Combined Datapath
•Stages
•Insert multiplexer
•Stage registers
•Union of registers added for each instruction
Combined Datapath 24
Jump
Book- P&H-COD
25
Control unit for Pipelined MIPS processor

• Identify the control signals
• Jump
• RegDst
• RegWrite Fetch Decode Execute Memory Write Back
• ALUSrc jump ALUSrc Branch MemtoReg
• Branch ALUOp MemRead RegWrite
• ALUOp
RegDst MemWrite
• MemRead
• MemWrite
• MemtoReg
• How does one generate the control signals?
• We want to add a minerals liquid with the water flowing through a pipeline after
every 1 KM
• How?
Control generation: Single-cycle 26
Instr. Jump RegDst RegWrite ALUSrc Branch ALUOp1 ALUOp0 MemRead MemWrite MemtoReg
R-type 0 1 1 0 0 1 0 0 0 0
lw 0 0 1 1 0 0 0 1 0 1
sw 0 x 0 1 0 0 0 0 1 x
addi 0 0 1 1 0 0 0 0 0 0
B-type 0 x 0 0 1 0 1 0 0 x
J-type 1 x 0 x x x x 0 0 x
Control generation: Multi-cycle 27
Fetch Decode Execute Memory Write Back
Starting State jump ALUSrc Branch MemtoReg
ALUOp MemRead RegWrite
IF ID
RegDst MemWrite
(T0) (T1)
J
ADD LW ADDI BNE
SW
EXE EXE EXE
EXE (T8)
(T6) (T2) ADDI (T10)
LW
SW
ADD
MEM MEM MEM MEM

(T7) (T3) (T5) (T9)
LW
WB
(T4)
Control generation: Strategy - 1 28
Instr Instr Instr Instr
P
IM
C

Control generation: Strategy - 2 29
Book- P&H-COD
30

Instr Execution/Address Calc stage control lines Memory access stage control Write-back control
lines lines
Instr Jump RegDs ALUOp ALUOp ALUSrc Branch MemRea MemWrit RegWrit MemtoRe
t 1 0 d e e g
R-format 0 1 1 0 0 0 0 0 1 0
lw 0 0 0 0 1 0 1 0 1 1
sw 0 x 0 0 1 0 1 0 0 x
beq 0 x 0 1 0 1 0 0 0 x
Single cycle MIPS processor
Instr Jump RegDs ALUOp ALUOp ALUSrc Branc MemRea MemWrit RegWri MemtoRe
t 1 0 h d e te g
R-format 0 1 1 0 0 0 0 0 1 0
lw 0 0 0 0 1 0 1 0 1 1
sw 0 x 0 0 1 0 1 0 0 x
beq 0 x 0 1 0 1 0 0 0 x
31

• How to generate such control signals?
• Settings the 10 control lines in each stage for each instruction
• Simplest way is same as in single cycle
• Most the controls can be generated at the same time or decoding stage
• How to manage the control signals generated for i-th instruction and
control signal will be generated for (i+1)-th instructions?
• Erroneous control signals can be generated
32
• How to manage the control signals generated for i-th instruction and
control signal will be generated for (i+1)-th instructions?
• Erroneous control signals can be generated
• Extension of the pipeline registers for storing the control signals’ values
Pipelined Datapath & Control 33
Pipelined Control Signals
34
RegWriteD RegWriteE RegWriteM RegWriteW
MemtoRegD MemtoRegE 2 bits
Control MemtoRegM MemtoRegW
MemWriteD Regs
pipeline MemWriteE 5 MemWriteM
MemReadD bits
MemReadE MemReadM
Contr Regs
BranchD 9 bits BranchE
opcode ol BranchM
Regs
ALUOpD [1:0] ALUOpE [1:0]
Unit
ALUSrcD ALUSrcE
RegDstD RegDstE
JumpD
ALU DM
P RF Data
C pipeline
ALUDec
35
Designing Instruction Sets for Pipelining

• MIPS’s instructions are same length
• X86’s instructions vary 1 byte to 15 byte, is pipelining challenging ??
• MIPS has a few addressing modes
• Memory operands only appear in loads or stores in MIPS
• Operand are aligned in memory
Comparison 36
of
datapaths
CLK Only one instr. In the

datapath at an instant of
CL: Combinational Logic CL time
CLK
CL Only one instr.

CL CL
What if one instruction

is here.
CLK
CL CL CL CL CL
instr. #5 instr. #4 instr. #3 instr. #2 instr. #1

Microprocessor Design Trade-offs: Interconnects Vs Functional Units Vs IPC 37
• (clock) Cycle Per Instruction (CPI)

• Instructions Per (clock) Cycle/Seconds (IPC) = 1/CPI
Functional Units
Interconnects (FUs)
(Bus)
Less More
Methods/Algorithms:
Single-bus & Single-FU Single-bus & Many-FUs
Less 1) Multi-Cycle
(Multi-Cycle, IPC < 1) (Multi-Cycle, IPC < 1)
2) Single-Cycle
Many-bus & Single-FU Many-bus & Many-FUs 3) Pipelined
More
(Multi-Cycle, IPC < 1) (Single-Cycle or Pipeline, IPC = 1)
• Pipeline: IPC = 1 (borrowed from Single-Cycle) and less clock period (T)
(borrowed from Multi-Cycle), shared the Buses & FUs by more than one
instruction.
• Program Execution time: #instr. x (1/IPC) x Clk (T)
• Can we have the IPC > 1?

38
Can we have IPC > 1?
Multiple issue processor

Superscalar and VLIW
IPC = 2, Superscalar or multiple issue processor 39
IPC = 3, Superscalar or 3 issue processor 42
• 3-instruction can be fetched or issued

• 3-instructions can be executed in parallel (in-order superscalar execution)
Depth
Instruction-1
Instruction-2
Instruction-3
Spatial parallelism
43
Static Multiple Issue MIPS Processor
• Two issue Instruction • What we did when merged two
instr.?
• Integer ALU operations
• Integer ALU operations
• ADD, BNE, etc • ADD, BNE, etc
• Data transfer operations • Data transfer operations
• LW & SW • LW & SW
Book-COD-P&H, CH-4
Static Multiple Issue Processor 44
Book-COD-P&H, CH-4
Very Long Instruction Word (VLIW)
If one instruction of the pair cannot be used, we require that it

be replaced with a nop. Thus, the instructions always issue in
pairs, possibly with a nop in one slot.
In some designs, the compiler takes full responsibility for removing all
hazards, scheduling the code and inserting no-ops so that the code
executes without any need for hazard detection or hardware-generated
stalls.
Book-COD-P&H, CH-4
Figure- A static two-issue datapath.

The additions needed for double issue are highlighted: another
32 bits from instruction memory, two more read ports and one
more write port on the register file, and another ALU. Assume
the bottom ADDER handles address calculations for data Book-COD-P&H, CH-4
transfers and the top ALU handles everything else.
47
Difference between Superscalar and VLIW

•General & special Instr.
•Compiler Vs Dynamic scheduling
•Hazard detection
•Instruction format
•Different VLIW processor needs compilation of application
Book-COD-P&H, CH-4
ISA design steps 48
• Step-1:
• Find out the instructions for the Algorithm(s)
• Step-2: [Microarchitecture design]
• Find out the strategy (Sharedbus/Singlecycle/Multicycle/Pipeline[in order]/etc) for datapath
and next
• Design the datapath and its components for each instructions
How about Single-purpose
• Step-3: microprocessor like
• Design the combined datapaths for all instructions MinMax microprocessor?
• Step-4:
• Decide the clock period based on the critical path [timing analysis]
• Add setup time, clock-to-Q and etc. to the decided clock period [satisfy hold time
constraints]
• Step-5:
• Identify the control signals on the combined datapath
• Step-6:
• Design the Control Unit (H/W or S/W) for generating the such control signals based on the
strategy (Sharedbus/Singlecycle/Multicycle/Pipeline[in order]/etc) decided for datapath
• Step-7:
• Test & verification of the designed processor
49
Applying Pipeline Technique in Other
Processors
•MinMax Processor
•Recording is available in the Google-classroom
•Simple CPU
•Recording is available in the Google-classroom
50
Homework
• Design the Pipelined MIPS ISA using Verilog HDL and C++
• Convert
• MinMax microprocessor in Pipelined MinMax
• Design the Pipelined MinMax microprocessor using Verilog HDL and
C++
• How does Intel manages to run CISC-type code onto RISC-based
pipeline?
51
Summary
• Limitation of Multi-cycle approach
• CPI Vs IPC
• Comparison between single-cycle and pipelined approaches
• Views of pipeline in operation
• Comparison of datapaths
• Design tradeoffs of microprocessors
• Datapath and CU for pipelined processor

Slide 4

Uploaded by

Copyright:

Available Formats

Slide 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slide 4

Uploaded by

Copyright:

Available Formats

Computer Architecture

Design, Analysis, Execution and Optimization of Instructions

Memory Words Results

What is the difference in this Analogical or Parallel reasoning?

Latency of each instructions is unchanged, but throughput is ideally 5-times better

Pipelined MIPS-based processor

Comparison of timing diagram

Register clk-to-Q Tpcq 30

Register setup Tsetup 20

ALU TALU 200

Memory read Tmem 250

Register file read tRFread 150

Register file write tRWrite 100

Register file setup tRFsetup 20

• Delay of MUX & register is not included

Timing diagram of (a) single-cycle processor (b) pipelined processor

•Single-cycle processor •Pipelined processor

IF_ID ID_EXE EXE_MEM MEM_WB

250 PS 150 PS 200 PS 250 PS

Delay values are from the previous table.

Reg. File’s write operation @posedge

Control unit for Pipelined MIPS processor

MEM MEM MEM MEM

Instr Instr Instr Instr

IF_ID ID_EXE EXE_MEM MEM_WB

Control unit for Pipelined MIPS processor

Control unit for Pipelined MIPS processor

Control unit for Pipelined MIPS processor

Designing Instruction Sets for Pipelining

CLK Only one instr. In the

CL Only one instr.

What if one instruction

instr. #5 instr. #4 instr. #3 instr. #2 instr. #1

• (clock) Cycle Per Instruction (CPI)

• Can we have the IPC > 1?

Can we have IPC > 1?

Multiple issue processor

• 3-instruction can be fetched or issued

Very Long Instruction Word (VLIW)

If one instruction of the pair cannot be used, we require that it

Figure- A static two-issue datapath.

Difference between Superscalar and VLIW

You might also like