Homework3 Solution v2
Homework3 Solution v2
1
HW 3: Computer Organization and Design Spring 2023
Solution:
IF/ID Register:
• Instruction = 32 bits
ID/EX Register:
ID/EX register needs a total of 282 bits. The break down is as follows:
• input for ALU Control (assuming that the ALU control exists) = 4 bits
EX/MEM Register:
EX/MEM register needs a total of 137 bits. The break down is as follows:
Page 2 of 41
HW 3: Computer Organization and Design Spring 2023
MEM/WB Register:
MEM/WB register needs a total of 135 bits. The break down is as follows:
• WB control line = 2 bits
Also, assume that instructions executed by the processor are broken down as follows:
(i) What is the clock cycle time in a pipelined and non-pipelined processor?
Solution:
(ii) What is the total latency of an ld instruction in a pipelined and non-pipelined processor?
Solution:
(iii) If we can split one stage of the pipelined datapath into two new stages, each with half the
latency of the original stage, which stage would you split and what is the new clock cycle
time of the processor?
Page 3 of 41
HW 3: Computer Organization and Design Spring 2023
Solution:
We can split the MEM stage, since it has the highest latency. This will reduce
the clock cycle time to 300 ps.
(iv) Assuming there are no stalls or hazards, what is the utilization of the data memory?
Solution:
Data memory is used for ld and sd instructions. Therefore, the utilization of the
data memory is:
20 + 25 = 45%
(v) Assuming there are no stalls or hazards, what is the utilization of the write-register port
of the “Registers” unit?
Solution:
Write register port for the register unit is used for ld and R-type instructions. There-
fore, the utilization is:
40 + 20 = 60%
(i) How many cycles are required to implement/execute one instruction on this pipeline?
Solution:
Page 4 of 41
HW 3: Computer Organization and Design Spring 2023
Since each stage takes 1 cycle and there are 9 stages, a total of 9 cycles are needed
to execute one complete instruction.
(ii) How many cycles are required to execute 17 instructions on this pipeline? Assume that
no stall cycles occur during the execution of all instructions.
Solution:
(iii) Assume that all necessary bypass circuitry is implemented in this 9 stage pipeline. How
many cycles will the pipeline stall during the execution of below given two instructions?
b. x2 = x5 + x7
Solution:
2 NOPs or stalls would be needed such that when instruction (a) is in stage 8, instruc-
tion (b) is in stage 5. Due to pipelining and bypass circuitry, the value of x5 would
be directly sent to the ALU of instruction (b). Forwarding would be done as follows:
(iv) Assume that no bypass circuitry is implemented in this 9 stage pipeline. How many cycles
will the pipeline stall during the execution of above mentioned two instructions?
Solution:
Assuming that operand fetch is the stage where data is read from the
registers
A total of 6 stalls would be needed so that when instruction (a) writes the value
for x5 back in the memory, instruction (b) can read it without conflict.
However, if writing to registers (register write back) and reading from registers (operand
Page 5 of 41
HW 3: Computer Organization and Design Spring 2023
fetch) can take place in the same cycle, then a total of 5 stalls would be needed.
(v) Assume that all necessary bypass circuitry is implemented in this 9 stage pipeline. How
many cycles will the pipeline stall during the execution of below given two instructions?
a. x5 = x1 + x3
b. x2 = x5 + x7
Solution:
1 NOP or stall would be needed such that when instruction (a) is in stage 7, in-
struction (b) is in stage 5. Because of the bypass circuitry, the value for x5 would
be sent directly to the ALU for instruction (b)’s computation. Forwarding would be
done as follows:
(vi) Assume that no bypass circuitry is implemented in this 9 stage pipeline. How many cycles
will the pipeline stall during the execution of above mentioned two instructions?
Solution:
Assuming that operand fetch is the stage where data is read from the
registers
A total of 6 stalls would be needed so that when instruction (a) writes the value
for x5 back in the memory, instruction (b) can read it without conflict.
However, if writing to registers (register write back) and reading from registers (operand
fetch) can take place in the same cycle, then a total of 5 stalls would be needed.
(vii) How much total time (in ns) is required to execute one entire instruction if the nine stages
were not pipelined?
Solution:
15 ns
A load instruction uses all 9 of the stages so the cycle time of any instruction would
be defined based on the cycle time for a load instruction which is 15ns.
Page 6 of 41
HW 3: Computer Organization and Design Spring 2023
(viii) What is the delay of 1 cycle (in ns) when all the nine stages are pipelined?
Solution:
The delay of 1 cycle would be 2.3ns which is given by the slowest stage ie stage
5/6.
(ix) What is the total time (in ns) required to execute one entire instruction if the nine stages
are pipelined?
Solution:
In a pipelined version, the cycle time for each stage would be the same ie it would 2.3ns
per stage. In that case, the total time needed for one instruction would be 20.7ns
(2.3 * 9).
(x) Below are given two set of codes. For each set of code, mention if forwarding circuit can
avoid all the stalls in the code?
ld x1,0(x3)
add x2, x1, x3
Solution:
Assume that perfect branch prediction is used (no stalls due to control hazards), that the
pipeline has full forwarding support, and that branches are resolved in the EX (as opposed to
the ID) stage.
(i) Show a pipeline execution diagram for the first two iterations of this loop.
Page 7 of 41
HW 3: Computer Organization and Design Spring 2023
Solution:
. . . indicates a stall/bubble.
I have numbered the lines of the code and will use this numbering in the pipeline
execution diagram.
1. LOOP: ld x11, 8(x13)
2. ld x10, 0(x13)
3. add x12, x10, x11
4. addi x13, x13, -16
5. bne x12, x0, LOOP
1 IF ID EX ME | WB
2 IF ID EX | ME WB
3 IF ID | ... EX ME WB
4 IF | ... ID EX ME WB
5 | ... IF ID EX ME WB
1 IF ID EX ME WB
2 IF ID EX ME WB
3 IF ID ... EX | ME WB
4 IF ... ID | EX ME WB
5 IF ID EX ME WB
(ii) Mark pipeline stages that do not perform useful work. How often while the pipeline is full
do we have a cycle in which all five pipeline stages are doing useful work?
Solution:
! indicates a stage that does not do useful work.
1 IF ID EX ME | WB
2 IF ID EX | ME WB
3 IF ID | ... EX ME! WB
4 IF | ... ID EX ME! WB
5 | ... IF ID EX ME! WB!
1 IF ID EX ME WB
2 IF ID EX ME WB
3 IF ID ... EX | ME! WB
4 IF ... ID | EX ME! WB
5 IF ID EX ME! WB!
In a clock cycle, a pipeline stage is not performing useful work if it is being stalled or
if an instruction at a particular stage is not doing any useful work.
Page 8 of 41
HW 3: Computer Organization and Design Spring 2023
Solution:
An add instruction uses I-Mem, 2 register reads and register write. This requires
a total of 340pJ (120 + 80 + 80 + 60) in a single-cycle design.
The amount of energy needed in a five-stage pipelined design is the same too since the
amount of work done is the same.
(ii) What is the worst-case RISC-V instruction in terms of energy consumption? What is the
energy spent to execute it?
Solution:
An instruction that uses all of the above components would be the worst case RISC-V
instruction. Load instruction uses I-Mem, one read register, one write register and
D-Mem Read. Although data from only one source register is used, the second register
is still read. So the energy consumption by a load instruction becomes 460pJ (120pJ
+ 80pJ + 60pJ + 120pJ + 80pJ). In short, the worst instruction is load which uses
460pJ.
(iii) If energy reduction is paramount, how would you change the pipelined design? What is
the percentage reduction in the energy spent by an ld instruction after this change?
Solution:
We can use a control circuitry that controls the reading of registers. Since load only
needs to read one register, the control circuitry would stop the reading of the other reg-
ister and hence save 80pJ energy. This would result in (80/460 = 17.39%) reduction
in the energy spent by a load instruction.
(iv) What other instructions can potentially benefit from the change discussed in (iii)?
Page 9 of 41
HW 3: Computer Organization and Design Spring 2023
Solution:
All I-type instructions would benefit from this change in the architecture. Similarly,
energy consumption by jump instructions would also reduce.
(v) How do your changes from (iii) affect the performance of a pipelined CPU?
Solution:
Since the reading of the registers is now dependant on the control circuit, register
read and control signal generation cannot be done simultaneously. Hence, some addi-
tional time is spent. Thus, some more latency is added and the CPU time increases.
(vi) We can eliminate the MemRead control signal and have the data memory be read in
every cycle, i.e., we can permanently have MemRead=1. Explain why the processor still
functions correctly after this change. If 25% of instructions are loads, what is the effect of
this change on clock frequency and energy consumption?
Solution:
Reading memory in every cycle does not effect the cycle time/ clock frequency. The
data being read might or might not be passed through the MUX. This increases the
risc of bad performance due to cache misses.
However, this does affect energy consumption. For the instructions where memory
read is not needed, an additional 120pJ is being used. If 25% of the instructions are
load then 75% instructions are using an additonal 120pJ.
Solution:
Page 10 of 41
HW 3: Computer Organization and Design Spring 2023
(ii) In the non-pipelined version, would a program with the instruction mix presented in Ques-
tion 2 run faster or slower on this new CPU? By how much? (For simplicity, assume every
ld and sd instruction is replaced with a sequence of two instructions.)
Solution:
Every instruction was initially taking 1250 × n ps, where n is the number of instruc-
tions. The same program would have
40 + 15 + (20 × 2) + (25 × 2)
= 1.45 × n instructions
100
when it is compiled for the modified processor. Therefore the time taken by the new
processor is
1100 × 1.45 × n = 1595 · nps
We can say that program with the instruction mix presented in question runs slower
on this new CPU. The speed up can be calculated by
1250
= 0.783
1595
(iii) What is the primary factor that influences whether a program will run faster or slower on
the new CPU?
Solution:
The time for the program to run depends on the number of ld/sd instructions. Also,
a program whose loads and stores tend to be to only a few different address may also
run faster on the modified CPU.
(iv) Do you consider the original CPU a better overall design; or do you consider the new CPU
a better overall design? Why?
Solution:
I consider the original design better since it is faster compared to the modified one in
case of a single cycle processor. Nevertheless, if we consider the pipelined processor,
there is no change the modification has no effect on the clock cycle time.
(v) As a result of the change, the MEM and EX stages of the pipelined version of the processor
can be overlapped and the pipeline has only four stages. How will the reduction in pipeline
depth affect the cycle time?
Solution:
Page 11 of 41
HW 3: Computer Organization and Design Spring 2023
There will be no affect on the clock cycle time since we are not making any changes
to the stage with the highest latency
(vi) How might this change improve the performance of the pipeline?
Solution:
Running EX and MEM stages in paralles can effectively reduce the number of stalls
as it will eliminate the need for a cycle or NOPs in case the program has a load
instrcution followed by the instruction that use the resultant loaded register value
(vii) How might this change degrade the performance of the pipeline
Solution:
As discussed in previous parts we are replacing the ld instruction with addi and ld
and sd instruction with addi and sd. This will increase the number of instructions.
This will consequently degrade the performance of the pipeline.
We want to modify the RISC-V processor to support the above instructions. For parts (b)
and (d) below, you can use a printed version of the figures in the book over which you can draw
your suggested modifications (no need to draw the entire diagram from scratch). For each of
the above instructions, do the following:
(A) Suggest if any of the existing instruction formats is a good choice to encode the new
instruction. If not, then propose a new instruction format.
Page 12 of 41
HW 3: Computer Organization and Design Spring 2023
(B) Modify the datapath and control signals of the single-cycle RISC-V processor (Figure 4.17
of the book) to execute the new instruction using the instruction format suggested in part (a).
Use the minimal amount of additional hardware and clock cycles/ control states. Remember
when adding new instructions, don’t break the operation of the standard ones.
(C) Discuss the effect of the modification in part (b) on the latency of single-cycled non-
pipelined CPU
(D) Discuss if the suggested modification in part (b) should be handled by increasing/decreasing
the number of pipelining stages that were discussed in class. Draw a pipelined version of the
new processor similar to Figure 4.49 of the book.
(E) Disucss if any new types of data hazards are introduced due to the new instruction? If yes,
can they be mitigated through forwarding? Use a mutli-cycle pipeline diagram like Figure 4.51
of the book to illustrate the new forwarding paths.
(F) Discuss, based on the above analysis, why the new instruction was not made part of the
RISC-V architecture.
Solution:
The format of an R-type instruction can be used to encode Load Word Reg-
ister instruction
(B) Modify the datapath and control signals of the single-cycle RISC-V
processor (Figure 4.17 of the book) to execute the new instruction
using the instruction format suggested in part (A). Use the minimal
amount of additional hardware and clock cycles/control states. Re-
member when adding new instructions, don’t break the operation of
the standard ones.
The basic structure of the processor stays the same. ALU takes the first in-
put from ReadData1 and the second from ReadData2. The computed sum is
Page 13 of 41
HW 3: Computer Organization and Design Spring 2023
then given to the DataMemory as the address of the data to be loaded. This
data is then passed on to the MUX and then to writedata is the Register block
where the data is written in rd. The values of the control lines would be as
follows:
• Branch = 0
• MemRead = 1
• MemtoReg = 1
• ALUOP = 00
• MemWrite = 0
• ALUSrc = 0
• RegWrite = 1
The processor after the addition of this instruction would look the same:
(C) Discuss the effect of the modification in part (B) on the latency of
single-cycle non- pipelined CPU
No major components were added only the values for the control signals changed
so there would be no change in the latency.
(D) Discuss if the suggested modification in part (B) should be handled by
increasing/decreasing the number of pipelining stages that were dis-
cussed in class. Draw a pipelined version of the new processor similar
to Figure 4.49 of the book.
Page 14 of 41
HW 3: Computer Organization and Design Spring 2023
(E) Discuss if any new types of data hazards are introduced due to the new
instruction? If yes, can they be mitigated through forwarding? Use a
multi-cycle pipeline diagram like Figure 4.51 of the book to illustrate
the new forwarding paths.
(F) Discuss, based on the above analysis, why the new instruction was not
made part of the RISC-V architecture.
The new instruction actually gives a benefit when the index of an array is stored
in a variable (which is commonly the case in loops), this new instruction would
allow us to save the effort of having to add it to the base register before ’ld’
instruction. Hence instruction count in this case is going down, and performance
is improving.
Although, the argument against that can be that instead of array index based
implementation, we can replace the loop implementation with a pointer based
implementation, where the need to perform that addition goes away.
So, since the use case for index variable can be taken care of at compiler level, they
probably deemed it unnecessary to add an instruction to their RISC (”reduced”
instruction set computer) architecture, whose design objective is to minimize the
number of instructions in the architecture.
Page 15 of 41
HW 3: Computer Organization and Design Spring 2023
Solution:
There is no existing instruction format for this instruction. A possible new in-
struction format can be:
(B) Modify the datapath and control signals of the single-cycle RISC-V
processor (Figure 4.17 of the book) to execute the new instruction
using the instruction format suggested in part (A). Use the minimal
amount of additional hardware and clock cycles/control states. Re-
member when adding new instructions, don’t break the operation of
the standard ones.
One additional adder is added and we increase the register reading port from
2 to 3 in the register file. Since for this instruction, the output of the adder that
we added has to be written in register file, we have to increase the number of bits
of the control signal MemtoReg. MemtoReg now chose between three option:
Writing the output of AlU to registerfile, writing the output of the Read Data
port of data memory and writing the output of the adder that we have added
particularly for this instruction. The basic working of the processor, however,
stays unchanged.
• Branch = 0
• MemRead = 0
• MemtoReg = 10
• ALUOP = 00
• MemWrite = 0
• ALUSrc = 0
• RegWrite = 1
Page 16 of 41
HW 3: Computer Organization and Design Spring 2023
(C) Discuss the effect of the modification in part (B) on the latency of
single-cycle non- pipelined CPU
The latency will not increase due to increase in the number of reading ports
of the register file. This is because reading is done in parallel.
However, there is an additional adder now. Since the adder is in parallel with
memory access the worst case latency is still the same.
Even though we adding an adder, the addition can be done in parallel with
MEM stage. Hence, there will be no new pipeline stage
Page 17 of 41
HW 3: Computer Organization and Design Spring 2023
(E) Discuss if any new types of data hazards are introduced due to the new
instruction? If yes, can they be mitigated through forwarding? Use a
multi-cycle pipeline diagram like Figure 4.51 of the book to illustrate
the new forwarding paths.
Lets take this arbitrary set of instructions that highlight the new data hazards
that might occour:
ld x2, 5(x3)
add3 x1, x4, x6, x2
We can call this data hazard MEM/WB to ALU 3rd arg (or, more precisely,
1st arg of the new Adder) where third operand of add3 is dependent on destina-
tion register of a preceeding ld instruction. This hazard can be mitigated using
a forwarding path between the MEM/WB pipeline register to the input of the
new adder which is located in the MEM stage (shown below).
Page 18 of 41
HW 3: Computer Organization and Design Spring 2023
(F) Discuss, based on the above analysis, why the new instruction was not
made part of the RISC-V architecture.
Adding significant additional hardware like 64-bit adder, and a new forward-
ing unit, is not really providing much improvement in performance. Existing
pipeline and forwarding, already allows two add instructions to be executed in
pretty much the same amount of time. So this new instruction and new hardware
added to support it does not offer any significant improvement in performance.
Solution:
(B) Modify the datapath and control signals of the single-cycle RISC-V
processor (Figure 4.17 of the book) to execute the new instruction
using the instruction format suggested in part (A). Use the minimal
amount of additional hardware and clock cycles/control states. Re-
member when adding new instructions, don’t break the operation of
the standard ones.
Page 19 of 41
HW 3: Computer Organization and Design Spring 2023
• Branch = 0
• MemRead = 1
• MemtoReg = 10
• ALUOP = 00
• MemWrite = 0
• ALUSrc = 1
• RegWrite = 1
• Read2Src = 1
The processor would like something like this after this instruction is added:
Page 20 of 41
HW 3: Computer Organization and Design Spring 2023
(C) Discuss the effect of the modification in part (B) on the latency of
single-cycle non- pipelined CPU
Because of the addition of an extra Adder, the time needed for an instruction to
completely execute would now increase.
An extra stage can be added for the second ALU. The new MUX is added in the
IF stage.
Page 21 of 41
HW 3: Computer Organization and Design Spring 2023
(E) Discuss if any new types of data hazards are introduced due to the new
instruction? If yes, can they be mitigated through forwarding? Use a
multi-cycle pipeline diagram like Figure 4.51 of the book to illustrate
the new forwarding paths.
Four new kinds of data hazards are created due to the introduction of a new
pipeline register, and new adder which consumes register value. Four forwarding
paths will be needed from the new pipeline register to the four points in our
pipeline which consume register data.
Page 22 of 41
HW 3: Computer Organization and Design Spring 2023
(F) Discuss, based on the above analysis, why the new instruction was not
made part of the RISC-V architecture.
Solution:
There is no existing instruction format for this instruction. A possible new in-
struction format can be:
Page 23 of 41
HW 3: Computer Organization and Design Spring 2023
Note that we have swapped the standard places of rs1 and rs2 in this format.
This is because the given instruction uses rs1 for purposes that standard instruc-
tions use rs2 for. Same goes for rs2. So swapping places in the instruction format
will avoid the need to deal with this irregularity using hardware modifications.
(B) Modify the datapath and control signals of the single-cycle RISC-V
processor (Figure 4.17 of the book) to execute the new instruction
using the instruction format suggested in part (A). Use the minimal
amount of additional hardware and clock cycles/control states. Re-
member when adding new instructions, don’t break the operation of
the standard ones.
Equality checker is added and we increase the register reading port from 2 to
3 in the register file. We have also added two additional muxes. One mux is to
chose between normal branching and branch equal to memory instruction. The
second mux is to chose between output of ReadData3 and that of ShiftLeft1 to
be sent to the adder. An additional control signal which I have called ’BeqSrc’ is
added and it is sent as a selection line to the added multiplexors. Also, branch
control will now be asserted in both beq and beqm instructions.
• Branch = 1
• MemRead = 1
• MemtoReg = X
• ALUOP = 00
• MemWrite = 0
• ALUSrc = 1
• RegWrite = 0
• BeqSrc = 1
Page 24 of 41
HW 3: Computer Organization and Design Spring 2023
(C) Discuss the effect of the modification in part (B) on the latency of
single-cycle non- pipelined CPU
The comparator circuit is dependent on the output of the memory so this com-
parison is not happening in parallel with memory access. Hence we have a new
worst-case latency instruction (beqm) in the single-cycle processor. Cycle time
for all instructions will have to be slightly increased to deal with this.
Page 25 of 41
HW 3: Computer Organization and Design Spring 2023
(E) Discuss if any new types of data hazards are introduced due to the new
instruction? If yes, can they be mitigated through forwarding? Use a
multi-cycle pipeline diagram like Figure 4.51 of the book to illustrate
the new forwarding paths.
There can be two novel data hazards. Let us take an arbitrary set of instructions.
ld x2, 5(x20)
beqm x1, 50(x3), x2
Page 26 of 41
HW 3: Computer Organization and Design Spring 2023
This hazard can simply be mitigated by forwarding. In the above case, the third
operand x2 of beqm is dependent on the destination register of the preceding add
instruction. The data hazard can be called EX/MEM to adder and a forwarding
path between EX/MEM and adder can solve the hazard as demonstrated below:
(F) Discuss, based on the above analysis, why the new instruction was not
made part of the RISC-V architecture.
The additional comparator will increase the latency of memory access stage which
is already the worst stage hence the clock-cycle time for the pipeline goes up, thus
degrading performance. Since control decision can’t be moved to the ID stage,
the cost of control hazards is going up. Finally, the forwarding from MEM/WB
Page 27 of 41
HW 3: Computer Organization and Design Spring 2023
to ALU can already implement the beq followed by a load pretty fast in the
pipelined case. So the sacrifice in clock-cycle time by combining load and beq is
not really offering much benefit.
Solution:
(B) Modify the datapath and control signals of the single-cycle RISC-V
processor (Figure 4.17 of the book) to execute the new instruction
using the instruction format suggested in part (A). Use the minimal
amount of additional hardware and clock cycles/control states. Re-
member when adding new instructions, don’t break the operation of
the standard ones.
Page 28 of 41
HW 3: Computer Organization and Design Spring 2023
• MemWrite = 1
• ALUSrc = 1
• RegWrite = 1
• WriteRegSrc = 1
The processor would now look like this:
(C) Discuss the effect of the modification in part (B) on the latency of
single-cycle non- pipelined CPU
The addition of MUX would increase the latency. Since the ALU is working
in parallel, it won’t effect the latency.
(D) Discuss if the suggested modification in part (B) should be handled by
increasing/decreasing the number of pipelining stages that were dis-
cussed in class. Draw a pipelined version of the new processor similar
to Figure 4.49 of the book.
The new ALU that has been added can be adjusted in the already existing EX
stage. This would not lead to any conflicts because the first part of the instruction
requires the previous value and the new value is not written until the writeback
stage Thus, both the ALUs are independent and can work simultaneously under
the execution stage and hence no new stage is needed. The cycle time would not
change either because both the ALUs take the same amount of time t compute
the results. in short, no major changes are to be made for a pipelined version of
Page 29 of 41
HW 3: Computer Organization and Design Spring 2023
(E) Discuss if any new types of data hazards are introduced due to the new
instruction? If yes, can they be mitigated through forwarding? Use a
multi-cycle pipeline diagram like Figure 4.51 of the book to illustrate
the new forwarding paths.
No new type of data hazard is created. The register value consumed by the
new adder can be supplied by the output of the existing mux that is used for
forwarding to ALU 1st arg.
(F) Discuss, based on the above analysis, why the new instruction was not
made part of the RISC-V architecture.
Page 30 of 41
HW 3: Computer Organization and Design Spring 2023
ld x31, 32(x10)
sd x11, 8(x10)
sub x12, x14, x13
add x15, x12, x13
beq x24, x0, label
add x31, x31, x14
Suppose we modify the pipeline so that it has only one memory (that handles both instructions
and data). In this case, there will be a structural hazard every time a program needs to fetch
an instruction during the same cycle in which another instruction accesses data.
(i) Draw a pipeline diagram to show where the code above will stall.
Solution:
. . . indicates a stall/bubble.
I have numbered the lines of the code and will use this numbering in the pipeline
execution diagram.
1 ld x31, 32(x10)
2 sd x11, 8(x10)
3 sub x12, x14, x13
4 add x15, x12, x13
5 beq x24, x0, label
6 add x31, x31, x14
1 IF ID EX ME WB
2 IF ID EX ME WB
3 IF ID EX ME! WB
4 ... ... IF ID EX ME! WB
5 IF ID EX ME! WB!
6 IF ID EX ME! WB
(ii) In general, is it possible to reduce the number of stalls/NOPs resulting from this structural
hazard by reordering code?
Solution:
In any case, there will be a structural hazard every time a program needs to fetch an
instruction during the same cycle in which another instruction accesses data. Reorder-
ing the code will simply change the instructions in conflict. Number of stalls won’t
be reduced since every instruction has to be fetched. This means that data access for
ever instruction results in a stall.
(iii) Must this structural hazard be handled in hardware? We have seen that data hazards can
be eliminated by adding NOPs to the code. Can you do the same with this structural
hazard? If so, explain how. If not, explain why not.
Solution:
Since NOP instructions also have to be fetched from instruction memory, they will
Page 31 of 41
HW 3: Computer Organization and Design Spring 2023
also cause structural hazards. Therefore, we cannot solve a structural hazzard using
NOPs at the code level.
So it must be handled by the hardware. We will need a hazard detection unit needs
to insert stalls.
(iv) Approximately how many stalls would you expect this structural hazard to generate in a
typical program? Use the instruction mix from Question 2.
Solution:
ld and sd instructions access memory. Hence,
20 + 25 = 45%
(i) If there is no forwarding or hazard detection, insert NOPs to ensure correct execution.
Solution:
(ii) Now, change and/or rearrange the code to minimize the number of NOPs needed. You
can assume register x17 can be used to hold temporary values in your modified code.
Page 32 of 41
HW 3: Computer Organization and Design Spring 2023
Solution:
Rearranging the given code would not effect the number of NOPs.
Moreover, adding a register might increase the number of NOPs because now we need
to wait for the value to be written in the new register in addition to the wait for the
value to be written in the original register before we can proceed further.
(iii) If the processor has forwarding, but we forgot to implement the hazard detection unit,
what happens when the original code executes?
Solution:
A hazard detection unit is needed when a load data hazard occurs. All the other
hazards can be catered to by the forwarding unit. Since there is no load data hazard,
ie for two consecutive load instructions, the second load instruction does not use the
destination register of the first one, no hazard detection does not disrupt the execution
of the given code.
(iv) If there is forwarding, for the first seven cycles during the execution of this code, specify
which signals are asserted in each cycle by hazard detection and forwarding units in Figure
4.59 of the book.
Solution:
cycle 1 2 3 4 5 6 7
and x16, x10, x9 IF ID EX MEM WB
ld x29, 4(x16) IF ID EX MEM WB
ld x10, 0(x2) IF ID EX MEM WB
sub x29, x29, x15 IF ID EX MEM
sd x29, 0(x16) IF ID EX
The hazard detection signals are the same for each cycle as there is no stalling ie we
do not need to prevent the change in register values in any cycle. The signals are as
follows:
PCWrite = 1
IFID Write = 1
MUX controlling the control signals = 0 (the original values pass)
Page 33 of 41
HW 3: Computer Organization and Design Spring 2023
The forwarding unit signals for each cycle are given below:
(v) If there is no forwarding, what new input and output signals do we need for the hazard
detection unit in Figure 4.59? Using this instruction sequence as an example, explain why
each signal is needed.
(vi) For the new hazard detection unit from (v), specify which output signals it asserts in each
of the first five cycles during the execution of this code.
Solution:
Cycle No. 1 2 3 4 5 6
and x16, x10, x9 IF ID EX MEM WB
ld x29, 4(x16) IF ID ... ... EX
ld x10, 0(x2) IF ... ... ID
sub x29, x29, x15 IF
The signals are as follows:
Cycle No. PCWrite IFID Write Control MUX
1 1 1 0 (original values)
2 1 1 0 (original values)
3 1 1 0 (original values)
4 0 0 1 (passes 0)
5 0 0 1 (passes 0)
Page 34 of 41
HW 3: Computer Organization and Design Spring 2023
Solution:
Pipeline without forwarding hardware requires NOP instructions to correctly handle
data hazzards. The time take would, therefore, be
(ii) Different programs will require different amounts of NOPs. How many NOPs (as a per-
centage of code instructions) can remain in the typical program before that program runs
slower on the pipeline with forwarding?
Solution:
250 × ((x × n) + n) < 260n
250 × n(1 + x) < 260n
250 + 250x < 260
250x < 10
x < 0.04
∴ The program runs slower on the pipeline with forwarding when the NOP instruc-
tions are
4% of n
(iii) Repeat (ii); however, this time let x represent the number of NOP instructions relative to
n. (In (ii), x was equal to 0.3) Your answer will be with respect to x.
Page 35 of 41
HW 3: Computer Organization and Design Spring 2023
Solution:
250 × ((y × n) + n) < 200 × ((x × n) + n)
250 × n(1 + y) < 200 × n(1 + x)
250 + 250y < 200 + 200x
50 + 250y < 200x
1 + 5y < 4x
4x − 1
y<
5
(iv) Can a program with only .075*n NOPs (in the no-forwarding case) possibly run faster on
the pipeline with forwarding? Explain why or why not.
Solution:
200 × 1.075 × n < 250 ∗ n
Therefore, a a program with only .075*n NOPs (in the no-forwarding case cannot
possibly run faster on the pipeline with forwarding.
(v) At minimum, how many NOPs (as a percentage of code instructions) must a program have
before it can possibly run faster on the pipeline with forwarding?
Solution: The speedup must be greater than zero in order for it to run faster.
4x − 1
0<
5
0.25 < x
Therefore, the program should have at least 0.25 NOPs to run faster on pipeline with
forwarding.
The type of RAW data dependence is identified by the stage that produces the result (EX
or MEM) and the next instruction that consumes the result (1st instruction that follows the one
Page 36 of 41
HW 3: Computer Organization and Design Spring 2023
that produces the result, 2nd instruction that follows, or both). We assume that the register
write is done in the first half of the clock cycle and that register reads are done in the second
half of the cycle, so “EX to 3rd” and “MEM to 3rd” dependences are not counted because they
cannot result in data hazards. We also assume that branches are resolved in the EX stage (as
opposed to the ID stage), and that the CPI of the processor is 1 if there are no data hazards.
Assume the following latencies for individual pipeline stages. For the EX stage, latencies are
given separately for a processor without forwarding and for a processor with different kinds of
forwarding.
(i) For each RAW dependency listed above, give a sequence of at least three assembly state-
ments that exhibits that dependency.
Solution:
EX to 1st only:
add x5, x6, x7
add x9, x5, x8
add x10, x11, x12
The first instruction writes some new value in x5. This register is also needed by
the second instruction. Because of pipelining, a data hazard would occur in reading
register x5.
EX to 2nd only:
add x5, x6, x7
add x9, x8, x10
add x11, x5, x12
The first instruction writes some new value in x5. This register is also needed by the
third instruction. Because of pipelining, a data hazard would occur while reading the
data from x5.
Page 37 of 41
HW 3: Computer Organization and Design Spring 2023
ld x5, 40(x6)
add x9, x8, x7
add x10, x5, x12
The first instruction loads some data from the memory to x5. x5 is also needed by
the third instruction to perform addition. This creates a data hazard.
(ii) For each RAW dependency above, how many NOPs would need to be inserted to allow
your code from (i) to run correctly on a pipeline with no forwarding or hazard detection?
Show where the NOPs could be inserted.
Solution:
Page 38 of 41
HW 3: Computer Organization and Design Spring 2023
(iii) Analyzing each instruction independently will over-count the number of NOPs needed to
run a program on a pipeline with no forwarding or hazard detection. Write a sequence of
three assembly instructions so that, when you consider each instruction in the sequence
independently, the sum of the stalls is larger than the number of stalls the sequence actually
needs to avoid data hazards.
Solution:
ld x5, 0(x2)
add x9, x8, x10
add x11, x5, x9
In the above line of code, the lines 1 and 3 have a MEM to 2nd RAW depen-
dency which needs one stall (1 NOP) and the lines 2 and 3 have an EX to 1st RAW
dependency which needs two stalls (2 NOPs). So, a total of three stalls (3 NOPs) is
needed if each instruction is analyzed independently. However, when looking at them
althogether, only two stalls (2 NOPs) are needed.
ld x5, 0(x2)
add x9, x8, x10
NOP
NOP
add x11, x5, x9
(iv) Assuming no other hazards, what is the CPI for the program described by the table above
when run on a pipeline with no forwarding? What percent of cycles are stalls? (For sim-
plicity, assume that all necessary cases are listed above and can be treated independently.)
Solution:
Number of stalls per instruction as per part (b) = (0.10 * 2 + 0.25 * 2 + 0.10 * 1 +
0.15 * 1 + 0.15 * 2) = 1.25
CPI = 1 + 1.25 = 2.25
Page 39 of 41
HW 3: Computer Organization and Design Spring 2023
(v) What is the CPI if we use full forwarding (forward all results that can be forwarded)?
What percent of cycles are stalls?
Solution:
All of the above dependencies/conflicts can be resolved using forwarding except MEM
to 1st only which constitute 25%.
CPI = 1.25
Percentage of the number of cycles that are stalls = 20%.
(vi) Let us assume that we cannot afford to have three-input multiplexors that are needed for
full forwarding. We have to decide if it is better to forward only from the EX/MEM pipeline
register (next-cycle forwarding) or only from the MEM/WB pipeline register (two-cycle
forwarding). What is the CPI for each option?
Solution:
We have two options, either to forward from EX/MEM register or MEM/WB register.
For EX/MEM register, the number of stalls per RAW would be:
EX to 1st: 0
MEM to 1st: 2
EX to 2nd: 1
MEM to 2nd: 1
EX to 1st and 2nd: 1
This would give a stall of 0.9 per instruction and a CPI of 1.9.
For MEM/WB register, the number of stalls per RAW would be:
EX to 1st: 1
MEM to 1st: 1
EX to 2nd: 0
MEM to 2nd: 0
EX to 1st and 2nd: 1
This would give a stall of 0.5 per instrcution and a CPI of 1.5.
Considering the CPI of both the possibilities, forwarding from MEM/WB register
would be a wise choice.
(vii) For the given hazard probabilities and pipeline stage latencies, what is the speedup achieved
by each type of forwarding (EX/MEM, MEM/WB, for full) as compared to a pipeline that
has no forwarding?
Solution:
Page 40 of 41
HW 3: Computer Organization and Design Spring 2023
(viii) What would be the additional speedup (relative to the fastest processor from vii) be if
we added “time-travel” forwarding that eliminates all data hazards? Assume that the
yet-to-be-invented time-travel circuitry adds 100ps to the latency of the full-forwarding
EX stage.
Solution:
If we add the time travel circuitry, the latency would increase by 100 so it would be
820ps but the CPI would drop down to 1 because of no NOP. This would give a speed
up of (720*1.25)/(1*820) = 1.09.
Hence, the speedup decreased with the addition of the time travel component.
(ix) The table of hazard types has separate entries for “EX to 1st” and “EX to 1st and EX to
2nd”. Why is there no entry for “MEM to 1st and MEM to 2nd”?
Solution:
The number of stalls for EX to 1st and EX to 1st and 2nd is different; 2 and 1
respectively whereas the number of stalls for MEM to 1st is the same as the number
of stalls in MEM to 1st and 2nd, thus a separate hazard type hasn’t been mentioned.
Page 41 of 41