CO_unit_3 ppt
CO_unit_3 ppt
CO_unit_3 ppt
• If a memory operation is
involved, it takes place in stage
4.
•The register file also has a data input, C, and a corresponding address
input to select the register into which data are to be written.
•The inputs and outputs of any memory unit are often called input and
output ports. A memory unit that has two output ports is said to be dual-
ported.
•The output of the ALU is connected to the data input, C, of the register
file so that the results of a computation can be loaded into the destination
register.
ALU stage with Inter-stage registers
• ALU performs
calculation specified
by the instruction.
• Multiplexer MuxB
selects either RB or
the Immediate field
of IR.
• Results stored in RZ.
• Data to be written in
the memory are
transferred from RB
to RM.
• For a memory instruction, Memory stage(stage4)
RZ provides memory
address, and MuxY selects
read data to be placed in
RY [ex: Load R5, X(R7)] go
into Register file (stage5).
• RM provides data for a
memory write operation.
[ex: Store R6, X(R8)]
• For a calculation[Add R3,
R4, R5] instruction, MuxY
selects [RZ] to be placed in
RY and go into Register
file (stage5).
• Input 2 of MuxY is used in
subroutine calls.
The datapath in a Processor – Stages 2 to 5
Instruction Fetch and datapath
Execution Steps
Example: Add R3, R4, R5
1. Memory address [PC],
Read memory, IR Memory
data, PC [PC] 4
2. Decode instruction,
RA [R4], RB [R5]
3. RZ [RA] [RB]
4. RY [RZ]
5. R3 [RY]
Instruction Fetch and datapath
Execution Steps
Example: Load R5, X(R7)
1. Memory address [PC],
Read memory, IR Memory
data, PC [PC] 4
2. Decode instruction, RA [R7]
3. RZ [RA] Immediate value X
4. Memory address [RZ], Read
memory, RY Memory data
5. R5 [RY]
Instruction Fetch and
Execution Steps
Example: Store R6, X(R8) datapath
1. Memory address [PC], Read
memory, IR Memory data, PC
[PC] 4
2. Decode instruction,
RA [R8], RB [R6]
3. RZ [RA] Immediate value X,
RM [RB]
4. Memory address [RZ], Memory
data [RM], Write memory
5. No action
Instruction Fetch Section
Memory address generation
The addresses used to access the
memory come from two sources.
1. From the PC and
2.From register RZ in the datapath
when accessing instruction
operands.
Multiplexer MuxMA selects one
of these two sources to be sent to
the processor-memory interface.
The Instruction address generator
increments the PC after fetching an
instruction.
It also generates branch and
subroutine addresses.
MuxMA selects RZ when
reading/writing data operands.
Processor control section
When an instruction is
read, it is placed in IR.
The control circuitry
decodes the instruction.
It generates the control
signals that drive all units.
The Immediate block
gives the immediate
operand value in 32 bits,
according to the type of
instruction.
Instruction address generator
• Connections to registers RA and
RY are used to support
subroutine call and return
instructions.
• An adder is used to increment
the PC by 4 during straight-line
execution. It is also used to
compute a new value to
beloaded into the PC when
executing branch and subroutine
call instructions.
• One adder input is connected to
the PC. The second input is
connected to a multiplexer,
MuxINC, which selects either
the constant 4 or the branch
offset to be added to the PC.
The branch offset is given in the immediate field of the IR. The output
of the adder is routed to the PC via a second multiplexer, MuxPC, which
selects between the adder and the output of register RA.
PC [PC] 4 --
Main Memory
Instruction Fetch and Datapath for Subroutine call with indirection: Call_register R9
104
R-link
R9
500
500
104
104
Control signals
• The operation of the processor‘s hardware components is governed by
control signals. These signals used to
• Select multiplexer inputs to guide the flow of data.
• Set the function performed by the ALU.
• Determine when data are written into the PC, the IR, the register file,
and the memory.
• In each clock cycle, the results of the actions that take place in one
stage are stored in inter-stage registers, to be available for use by the
next stage in the next clock cycle. Since data are transferred from one
stage to the next in every clock cycle, inter-stage registers are always
enabled. This is the case for registers RA, RB, RZ, RY, RM, and PC-
Temp. The contents of the other registers, namely, the PC, the IR, and
the register file, must not be changed in every clock cycle.
Register file control signals
The register file has three 5-bit address inputs, allowing access to 32
general-purpose registers. Two of these inputs, Address A and Address B,
determine which registers are to be read. They are connected to fields
IR31−27 and IR26−22 in the instruction register.
The third address input, Address C, selects the destination register, into
which the input data at port C are to be written.
The third input of the multiplexer is the address of the link register
used in subroutine linkage instructions.
New data are loaded into the selected register only when the control
signal RF_write is asserted.
ALU control signals
The operation performed by the ALU is determined by a k-bit control
code, ALU_op, which can specify up to 2k distinct operations, such as
Add, Subtract, AND, OR, and XOR.
When an instruction calls for two values to be compared, a comparator
performs the comparison specified. The comparator generates condition
signals that indicate the result of the comparison. These signals are
examined by the control circuitry during the execution of conditional
branch instructions to determine whether the branch condition is true or
false. Z (zero) : Set to 1 if the result is 0.
2) & (3) Two source register identifiers- As there are total 64 registers,
they can be identified by 6 bits. As they are two i.e. 6 bit + 6 bit.
As there are 100 instructions, We have a size of 425 byte, which can be
stored in 500 byte memory from the given options.
1 -- -- -- -- --
2 -- -- -- -- --
3 85320 4200 -- -- --
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
• Sequential laundry takes
A
6 hours for 4 loads
D
Traditional Pipeline Concept
6 PM 7 8 9 10 11 Midnight
Time
T
a 30 40 40 40 40 20
s
k A
• Pipelined laundry takes 3.5
hours for 4 loads
O B
r
d C
e
r D
Traditional Pipeline Concept
• Pipelining doesn‘t help
latency of single task, it
helps throughput of entire
6 PM 7 8 9
workload
Time
• Pipeline rate limited by
T slowest pipeline stage
a 30 40 40 40 40 20
• Multiple tasks operating
s simultaneously using
A
k different resources
O B • Potential speedup =
r Number pipe stages
d C • Unbalanced lengths of pipe
e
stages reduces speedup
r
D
• Time to ―fill‖ pipeline
• Stall(stop making
Basic Concept of Pipelining
• Circuit technology and hardware arrangement influence the speed of
execution for programs
• All computer units benefit from faster circuits
• Pipelining involves arranging the hardware to perform multiple
operations simultaneously
• Same total time for each item, but overlapped
• Focus on pipelining of instruction execution , multistage datapath
consists of: Fetch, Decode, Compute, Memory, Write
• Instructions fetched & executed one at a time with only one stage
active in any cycle
• With pipelining, multiple stages are active simultaneously for different
instructions
• Still 5 cycles to execute, but rate is 1 per cycle
Pipeline Organization
• Use program counter (PC) to fetch instructions and every cycle a new
instruction enters in pipeline
• Carry along instruction-specific information as instructions flow
through the different stages. Use interstage buffers to hold this
information. These buffers incorporate RA, RB, RM, RY, RZ, IR, and
PC-Temp registers
The interstage buffers are used as follows
• Interstage buffer B1 feeds the Decode stage with a newly-fetched
instruction.
• Interstage buffer B2 feeds the Compute stage with the two operands read
from the register file, the source/destination register identifiers, the
immediate value derived from the instruction, the incremented PC value
used as the return address for a subroutine call, and the settings of control
signals determined by the instruction decoder. The settings for control
signals move through the pipeline to determine the ALU operation, the
•Interstage buffer B3 holds the result of the ALU operation, which may
be data to be written into the register file or an address that feeds the
Memory stage. In the case of a write access to memory, buffer B3 holds
the data to be written. These data were read from the register file in the
Decode stage. The buffer also holds the incremented PC value passed
from the previous stage, in case it is needed as the return address for a
subroutine-call instruction.
• Interstage buffer B4 feeds the Write stage with a value to be written
into the register file. This value may be the ALU result from the
Compute stage, the result of the Memory access stage, or the
incremented PC value that is used as the return address for a subroutine-
call instruction.
A five-stage pipeline
Instruction
Result of ALU,
Incremented PC value,
Address that feeds
memory stage, Data to be
written
ADD R1,R2,R3
R1 is updated in
Register file. 1st clock
ADD R1,R2,R3
Control signals
Two operands R2,R3 2nd clock
Incremented PC value,
Control signals, R1
ADD R1,R2,#50
R1 is updated in
Register file. 1st clock
ADD R1,R2,#50
Control signals
OperandR2,immediate 2nd clock
Value 50,Incremented
PC value, Control
signals, R1
2nd Clock
I1: ADD
R1,R2,R3
3rd Clock
4th Clock
I3: Store R7, A I4: Mul R8,R9,R10
4thClock
3rd I3: Store R7, A
Clock
I2: Load R4,R5,R6
4thClock
2nd Clock I2: Load
I1: ADD
R4,R5,R6
R1,R2,R3
4th Clock
I1: ADD
R1,R2,R3
5th Clock
I5: Sub R11,R12,R13
5th Clock
R1 is Updated for the
instruction
I1: ADD R1,R2,R3
5th Clock
I4: Mul R8,R9,R10
5thClock
I3: Store R7, A
5thClock
I2: Load R4,R5,R6
Hazard and types of Hazard
Any condition that causes a pipeline to stall is called a hazard.
1.Data hazard – any condition in which either the source or the
destination operands of an instruction are not available at the time
expected in the pipeline. So some operation has to be delayed, and the
pipeline stalls.
1. Data dependency
2. Memory Delays with Cache Miss and Cache Hit
3. Branching
Data dependency
• Consider two successive instructions Ij and Ij1 . [ex1]
A←3+A ; Ij : Add R1,#3 ; R1R1+ 3
B ← 4*A ; Ij1 : Mul R1,#4 ; R14*R1
• Assume that the destination register of Ij matches one of the source
registers of Ij1
• Result of Ij is written to destination in cycle 5, But Ij1 reads old value
of register in cycle 3.Due to pipelining, Ij1 computation is incorrect.
So stall (delay) Ij1 until Ij writes the new value. Condition requiring
this stall is a data hazard.
Data Dependencies
• Data dependencies leads to data hazards
• Now consider the specific instructions
Add R2, R3, #100
Subtract R9, R2, #30
• Destination R2 of Add is a source for Subtract
• There is a data dependency[Previous instruction destination
identifier is one of the source identifier in next instruction] between
them because R2 carries data from Add to Subtract
• On non-pipelined datapath, result is available
in R2 because Add completes before Subtract.
• There are two solution for data dependencies
1. Stalling
2. Data Forwarding
Stalling the Pipeline
• With pipelined execution, old value is still in register R2 when
Subtract is in Decode stage
• So stall Subtract for 3 cycles in Decode stage.
• New value of R2 is also available in cycle 6
• The idle time from each NOP is called a bubble[Here 3 bubbles]
ADD R2,R1,R0 IF ID EX WB
Here I2 is not enter into the decode stage and if I3 is enter into Fetch stage
then the contents of I2 in Interstage buffer B1[F/D buffer] is erased
l2 IF RD ---- EX MA WB
----
l3 IF RD EX MA WB
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
l1 IF RD EX MA WB
l2 IF ---- ---- RD EX MA WB
l3 IF ---- ---- RD EX MA WB
G10.Consider an instruction pipeline with four stages (S1, S2, S3 and
S4) each with combinational circuit only. The pipeline registers are
required between each stage and at the end of the last stage. Delays for
the stages and for the pipeline registers are as given in the figure, what
is the speed up of the pipeline in steady state under ideal conditions
when compared to the corresponding non-pipeline implementation ?
Branch Penalty
Branch Penalty
Cycle time = max of all stage delays + buffer delay = max (5 ns, 7 ns, 10
ns, 8 ns, 6 ns) + 1 = 10+1 = 11ns
Out of all the instructions I1, I2, I3....I12 it is given that only I4 is a branch
instruction and when I4 takes branch the control will jump to instruction I9
as I9 is the target instruction. From the timing diagram there is a gap of
only 3 stall cycles between I4 and I9 because after I4 enters Decode
Instruction (DI) whether there is a branch or not will be known at the end
of Execute Instruction (EI) phase. So there are total 3 phases namely DI,
FO, EI are in stall. After 3 stall cycles I9 will start executing as that is the
branch target.
Total no. of clock cycles to complete the program = 15
Since 1 clock cycle = 11ns, time to complete the program =15*11= 165ns
G14.Instruction execution in a processor is divided into 5 stages,
Instruction Fetch (IF), Instruction Decode (ID), Operand Fetch (OF),
Execute (EX), and Write Back (WB). These stages take 5, 4, 20, 10 and 3
nanoseconds (ns) respectively. A pipelined implementation of the
processor requires buffering between each pair of consecutive stages
with a delay of 2 ns. Two pipelined implementations of the processor are
contemplated:
(i) a naive pipeline implementation (NP) with 5 stages and
(ii) an efficient pipeline (EP) where the OF stage is divided into stages
OF1 and OF2 with execution times of 12 ns and 8 ns respectively.
Find the speedup (correct to two decimal places) achieved by EP over NP
in executing 20 independent instructions with no hazards.
There are data dependencies involving registers R4 and R7. The Subtract
instruction needs the new values for these registers before they are
written to the register file. Hence, those values need to be forwarded to
the ALU inputs when the Subtract instruction is in the Compute stage of
the pipeline.
As for the contents of registers RY and RZ during cycles 4 to 7,
Using the initial values for registers R2 and R3, the Add instruction
generates the result of 12 in cycle 3. That result is available in register
RZ during cycle 4.
In cycle 4, the Or instruction generates the result of 130. That result is
placed in register RZ to be available during cycle 5. The result of 12 for
the Add instruction is in register RY during cycle 5.
Cycles 1 2 3 4 5 6 7 8
IR
-- Add Sub And Add -- -- --
(Decode)
PC 1000 1004 1008 1012 1016 1020 1024 1028
RA -- -- 2000 50 50 2000 -- --
RB -- -- -- -- -- 50 -- --
RZ -- -- -- 2020 47 50 2050 --
RY -- -- -- -- 2020 47 50 2050
Ex2: Consider the following instructions at the given addresses in the
memory:
1000 Add R3, R2, #20
1004 Sub R5, R4, #3
1008 And R6, R3, #0x3A
1012 Add R7, R2, R4
Initially, registers R2 and R4 contain 2000 and 50, respectively. These
instructions are executed in a computer that has a five-stage pipeline .
The first instruction is fetched in clock cycle 1, and the remaining
instructions are fetched in successive cycles.
(a) Draw a diagram that represents the flow of the instructions through
the pipeline. Describe the operation being performed by each pipeline
stage during clock cycles 1 through 8.
(b) Also, describe the contents of registers IR, PC, RA, RB, RY, and RZ
in the pipeline during cycles 4 to 7.
Clock Cycles 1 2 3 4 5 6 7 8
Add R3,R2,#20 F D C M W
Sub R5,R4,#3 F D C M W
And
R6,R3,#0x3A F D C M W
Add R7,R2,R4 F D C M W
Cycles 1 2 3 4 5 6 7 8
IR
-- Add Sub And Add -- -- --
(Decode)
PC 1000 1004 1008 1012 1016 1020 1024 1028
R3
RA -- -- 2000 Value 50 2000 -- --
RB -- -- -- -- -- 50 -- --
RZ -- -- -- 2020 47 32 2050 --
RY -- -- -- -- 2020 47 32 2050
The contents of RZ in cycle 6 and RY in cycle 7 are determined as,
(2020 AND 3A16) = (7E416 AND 3A16) = 2016 = 32