Lecture # 7.

The Processor
Lecture # 7
Course Instructor: Dr. Noshina
Outline
• Single cycle control
• ALU control
• Single cycle design problems
• An overview of pipelining
• Pipelining analogy
• RISC-V Pipeline
• Pipeline performance example
• Pipeline speedup
Single cycle Datapath design
Control
• Control unit takes input from
– the instruction opcode bits
• Control unit generates
– ALU control input
– write enable (possibly, read enable also)
signals for each storage element
– selector controls for each multiplexor
ALU Control
• Depending on the instruction class, the ALU will need
to perform one of these four functions.
– add for load/stores
– sub for branch on equal
– one of and, or, add, sub for R-type instructions.
• We can generate the 4-bit ALU control input using a
small control unit that has as inputs the funct7 and
funct3 fields of the instruction and a 2-bit control field,
which we call ALUOp.
ALUOp
• ALUOp indicates whether the operation to be
performed should be
– add (00) for loads and stores
– subtract and test if zero (01) for beq, or
– Be determined by the operation encoded in
the funct7 and funct3 fields (10).
• The output of the ALU control unit is a 4-bit
signal that directly controls the ALU by
generating one of the 4-bit combinations.
ALUOp
Instruction classes
Control Signals
Datapath with Control I
New multiplexor
Control Lines
Single-cycle Implementation Notes
• The steps are not really distinct as each instruction
completes in exactly one clock cycle – they simply
indicate the sequence of data flowing through the
datapath.
• The operation of the datapath during a cycle is purely
combinational – nothing is stored during a clock cycle.
• Therefore, the machine is stable in a particular state at
the start of a cycle and reaches a new stable state only
at the end of the cycle.
Load Instruction Steps: ld x9, offset(x19)
1. Fetch instruction and increment PC
2. Read base register from the register file: the base register (x19) is
given by bits 19-15 of the instruction
3. ALU computes sum of value read from the register file and the
sign-extended upper12 bits (offset) of the instruction.
4. The sum from the ALU is used as the address for the data
memory
5. The data from the memory unit is written into the register file:
the destination register (x9) is given by bits 11-7 of the
instruction
Branch Instruction Steps: beq x5, x6, offset
1. Fetch instruction and increment PC
2. Read two register (x5 and x6) from the register file.
3. ALU performs a subtract on the data values from
the register file; the value of PC+4 is added to the
sign-extended shifted left by one upper 12 bits of
instrcution to give the branch target address
4. The Zero result from the ALU is used to decide
which adder result (from step 1 or 3) to store in the
PC
Single-Cycle Design Problems
• Assuming fixed-period clock every instruction datapath uses one clock
cycle implies:
– CPI = 1
– cycle time determined by length of the longest instruction path (load)
• but several instructions could run in a shorter clock cycle: waste
of time
• consider if we have more complicated instructions like floating
point!
– resources used more than once in the same cycle need to be
duplicated
• waste of hardware and chip area
An overview of pipelining
• Pipelining is an implementation technique
in which multiple instructions are overlapped
in execution.
• All steps in a task, called stages in
pipelining, operate concurrently.
• If we have separate resources for each
stage, we can pipeline the tasks.
• Pipelining improves performance by
increasing instruction throughput, as
opposed to decreasing the execution time of
CONTD…
• If all the stages take about the same

amount of time and there is enough work
to do, then the speed-up due to pipelining
is equal to the number of stages in the
pipeline.
Pipelining Analogy
• Pipelined laundry: overlapping execution
– Parallelism improves performance
RISC-V Pipeline
Five stages, one step per stage

1. IF: Instruction fetch from memory
2. ID: Instruction decode & register read
3. EX: Execute operation or calculate address
4. MEM: Access memory operand
5. WB: Write result back to register
Pipeline performance example
• Contrast the average time between instructions of
a single-cycle implementation, in which all
instructions take one clock cycle, to a pipelined
implementation. Assume that the operation times
for the major functional units in this example are
200 ps for memory access for instructions or
data, 200 ps for ALU operation, and 100 ps for
register file read or write. In the single-cycle
model, every instruction takes exactly one clock
cycle, so the clock cycle must be stretched to
accommodate the slowest instruction.
CONTD…
CONTD…
Single-cycle (Tc= 800ps)

CONTD…
Pipelined (Tc= 200ps)

Pipeline Speedup
• If all stages are balanced
– i.e., all take the same time
• If not balanced, speedup is less

• Speedup due to increased throughput
– Latency (time for each instruction) does not decrease

Lecture # 7.

Uploaded by

Copyright:

Available Formats

Lecture # 7.

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture # 7.

Uploaded by

Copyright:

Available Formats

The Processor

• If all the stages take about the same

Five stages, one step per stage

Single-cycle (Tc= 800ps)

Pipelined (Tc= 200ps)

• If not balanced, speedup is less

You might also like