Solutions: CS152 Computer Architecture and Engineering
Solutions: CS152 Computer Architecture and Engineering
Solutions: CS152 Computer Architecture and Engineering
Engineering
SOLUTIONS
ISAs, Microprogramming and Pipelining
Assigned 8/26/2016 Problem Set #1 Due September 13
The problem sets are intended to help you learn the material, and we encourage you to
collaborate with other students and to ask questions in discussion sections and office hours to
understand the problems. However, each student must turn in his own solution to the problems.
The problem sets also provide essential background material for the quizzes. The problem sets
will be graded primarily on an effort basis, but if you do not work through the problem sets you
are unlikely to succeed at the quizzes! We will distribute solutions to the problem sets on the day
the problem sets are due to give you feedback. Homework assignments are due at the beginning
of class on the due date. Late homework will not be accepted, except for extreme circumstances
and with prior arrangement.
Problem 1: CISC, RISC, accumulator, and Stack: Comparing ISAs
In this problem, your task is to compare four different ISAs. x86 is an extended accumulator,
CISC architecture with variable-length instructions. RISC-V is a load-store, RISC architecture
with fixed-length instructions (for this problem only consider the 32-bit form of its ISA). We will
also look at a simple stack-based ISA and at an accumulator architecture.
For the above x86 assembly code, how many bytes of instructions need to be fetched if b = 10?
4+10*(13)+10=144
Assuming 32-bit data values, how many bytes of data memory need to be fetched? Stored?
Fetched: the compare instruction accesses memory, and brings in a 4-byte word b+1
times:
4*11= 44
Stored: 0
Many translations will be appropriate; here’s one. Other people have used sub instead of slt.
Remember (as far as we are concerned for this PS, or Lab 1, or any Quiz), RISC-V instructions
are only 32 bits long so you need to construct a 32 bit address from 12-bit and 20-bit
immediates. Also, since the problem specified that the value of b was already contained in x1,
you could skip the lui/lw instructions entirely.
xor %ecx,%ecx
xor x3, x3, x3
cmp 0x8049580,%ecx
lui x6, 0x08049
lw x1, 0x580 (x6)
slt x5, x3, x1
jl L1
bne x5,x0, L1
jmp done
j done
2
add %eax,%edx
How many bytes is the RISC-V program using your direct translation?
10*4 = 40 (or 8*4=32 if you leave out the lui/lw)
How many bytes of RISC-V instructions need to be fetched for b = 10 using your direct
translation?
If you get part of the address of b into a register, you don’t have to repeat the lui. So there are 3
instructions in the prelude and 6 that are part of the loop (we don’t need to fetch the “j” until the
11 th iteration). There are 5 instructions in the 11 th iteration. All instructions are 4
bytes.
4*(3+10*6+5) = 272. If you kept b in a register, it’s smaller.
Assuming 32-bit data values, how many bytes of data memory need to be fetched? Stored?
Fetched: 11 * 4 = 44. (or zero if you keep B in a register)
Stored: 0
3
push i
inc ; i=i+1
pop i
goto loop
done:
Using your stack translations from part (c), how many bytes of stack instructions need to be
fetched for b = 10?
(5*3+2*1) + 10*(9*3+3*1)+(4*3+1) = 330
Assuming 32-bit data values, how many bytes of data memory need to be fetched?
fetched = 4*number of dynamic pushes. There are 2 in the prelude, 2 at loop that get executed 11
times, and 3 at L1 that get executed 10 times. 2+2*11+3*10=54. 54*4 bytes = 216 bytes
Stored?
stored = 4 * number of dynamic pops. 4*(3+2*10) = 92 bytes
Note that the stack-depth in this program never exceeds two words, so we don’t have to worry
about extra accesses for spilling.
If you could push and pop to/from a four-entry register file rather than memory (the Java virtual
machine does this), what would be the resulting number of bytes fetched and stored?
There are only four variables, so almost all memory accesses could be eliminated. If you stick to
a direct translation where you keep b in memory, then you would have to get it 11 times: 44
bytes
fetched, 0 bytes stored. If you keep b in a register, too, then you only have to get it once: 4 bytes
fetched, 0 bytes stored (but the code in 1.C’s answer doesn’t directly support this).
The above is just one way of doing it by using both the accumulator and subtractor. There is a
way to solve this problem with using just one of the two, by storing and loading values from
memory at every loop iteration, similar to the stack architecture. Both solutions are fine.
4
How many bytes is your program?
17
Can the same program be implemented with just one accumulator (i.e., no subtractor)?
There are two ways to answer this. If we don’t load and store values to and from memory at
every iteration the answer is no. By the nature of this ISA, the moment we load one variable, say
B, we cannot store another variable which is necessary to do the increments.
However, similar to the stack architecture, we can make this work with one accumulator. It just
takes more loads and stores.
If not, how would you extend this ISA to implement this program with just one accumulator?
If the answer to the above was no, we can simply add a multiply instruction, or add memory-
memory instructions (instructions that can operate on two memory addresses). However, the
latter would change the nature of the ISA.
CISC < RISC < STACK for both static and dynamic code size.
(RISC ≈ CISC) < STACK for data memory traffic
Most optimizations revolve around the elimination of unnecessary control flow. Also, the load
can be hoisted out of the loop.
This re-write brings dynamic code size down to 136 bytes; static code size to 28; and memory
traffic down to 4 bytes.
5
Problem 2: Microprogramming and Bus-Based Architectures
Note that to maintain “clean” behavior of your microcode, no registers in the register file should
change their value during execution (unless they are written to). This does not refer to
the registers in the datapath (IR, A, B, MA). Thus, using asterisks for the load signals (ldIR,
ldA, ldB, and ldMA) is acceptable as long as the correctness of your microcode is not affected
(and in fact, should be done for full optimality). Also note the ubr to FETCH0 must be contained
on its own line, since you can’t “spin” on the same micro-code line if memory is still busy OR
jump to FETCH0 if memory is not busy. S is either “spin on same micro-code line (upc)” or go
to upc+1.
When performing a memory access, you could be “spinning” for many cycles, waiting for
memory to become “not busy”. In that time, you must keep all inputs to the memory system
constant: thus, ldMA must be 0, because you don’t want the memory address to change while
accessing memory! Likewise, in “Mem <- A+B”, ldA and ldB must also be set to 0, so that the
data being sent to memory stays constant. To phrase this is another way, we have no idea when
the memory system latches in our inputs.
Finally, note the cleverness of ldA being “0” on FETCH2. On entering an instruction, A always
equals PC+4. This saves a cycle if we dispatch to a jump instruction, which first loads PC+4
into the ALU (or the RDNPC instruction, which loads PC+4 into rd).
6
7
Problem 2.B Implementing MOVN Instruction
A few notes:
-LdIR is zero for all uops because we keep needing to read the actual values of Rs, Rd which are
stored in the IR register
-ldMA is kept at 0 when performing a memory operation because memory operations are multi-
cycle and thus you need to hold the memory address constant (this logic also applies to ldA,ldB
when used as sources for memory).
8
Problem 2.C Instruction Execution Times
How many cycles does it take to execute the following instructions in the microcoded RICV-V
machine? Use the states and control points from RISC-V-Controller-2 in Lecture 2 (or Lab 1, in
${LAB1ROOT}/src/rv32_ucode/micrcode.scala) and assume Memory will not assert its busy
signal.
Instruction Cycles
ADD x3,x2,x1 3+3=6
ADDI x2,x1,#4 3+3=6
SW x1,0(x2) 3+5=8
BNE x1,x2,label #(x1 == x2) 3 + 4 = 7
BNE x1,x2,label #(x1 != x2) 3 + 3 + 4 = 10
BEQ x1,x2,label #(x1 != x2) 3 + 3 + 4 = 10
BEQ x1,x2,label #(x1 != x2) 3 + 4 = 7
J label 3+4=7
JAL label 3+5=8
JALR x1 3+5=8
As discussed in Lecture 2, instruction execution includes the number of cycles needed to fetch
the instruction. The lecture notes used 4 cycles for the fetch phase, while Worksheet 1 shows
that this phase can actually be implemented in 3 cycles —either answer was fine. The above
table uses 3 cycles for the fetch phase.
The above answers are derived from the micro-coded processor provided in Lab 1. It is okay if
your answers differ from having been derived from the lecture notes.
Overall, BNE (for a taken branch), and BEQ (for a taken branch) take the most cycles to execute
(10), while arithmetic functions such as ADD and ADDI take the fewest cycles (6).
9
Problem 3: 6-Stage Pipeline
The second write port improves performance by resolving some RAW hazards earlier than they
would be if ALU operations had to wait until writeback to provide their results to subsequent
dependent instructions. It would help with the following instruction sequence:
The important insight is that the second write port cannot resolve data hazards for immediately
back-to-back instructions. (Recall that the RF is read in the ID stage, and when after the first
instruction has written back, it is in M1, so the third instruction is in ID.)
The bypass path from the end of M1 to the end of ID can be removed. (Credit was also given for
the bypass path from the beginning of M2 to the beginning of EX, since these are equivalent.)
Additionally, ALU results no longer have to be bypassed from the end of M2 or the end of WB,
but these bypass paths are still used to forward load results to earlier stages.
Illegal address exceptions are not detected until the start of the M2 stage. Since writebacks can
occur at the end of the EX stage, it is possible for an ALU op following a memory access to an
illegal address to have written its value back before the exception is detected, resulting in an
imprecise exception. For example:
10
Stall any ALU op in the ID stage if the instruction in the EX stage is a load or a store. The
instruction sequence above engages this interlock.
Loads and stores account for about one-third of dynamic instructions. Assuming that the
instruction following a load or store is an ALU op two-thirds of the time, and ignoring
the existing load-use delay, this solution will increase the CPI by (1/3)*(2/3)==2/9. However,
only a qualitative explanation was necessary for credit.
In addition to reading an instruction’s source operands in the ID stage, also read the destination
register, rd. If an early writeback occurs before a preceding exception was detected, then the old
value of rd is preserved in the EX/M1 pipeline register and can be restored to the register file,
maintaining precise state.
The point in time that a branch comparison occurs is circled above. The second circle (OR) is
when the decode stage recognizes a dependent load in the EXE stage (at t5) and stalls.
11
The IF_KILL and DEC_KILL signal goes out in t5, when the “mispredict” is discovered.
Bubbles are inserted into the pipeline, and show up on t6.
Loop is not taken. The BHT is “strongly” taken, so BHT predicts “taken” when we see BEQ.
BHT is in Decode, and Fetch stage always predicts PC+4, so we eat 1 cycle when the BHT
predicts taken branch, and we eat another cycle if BHT predicts taken, but branch is actually not
taken (i.e., it just degregates to the original 2-cycle branch penalty). At t4 (as circled), the
Decode stage kills the fetch stage to redirect it down the “taken” path. However, at t5 we resolve
the branch comparison in Execute and must correct for the BHT’s misprediction, and kill Fetch
and Decode.
12
Problem 4.D Adding a BTB
BTB mispredicts the exit, and it takes two cycles for branch logic in Exe to catch the mistake.
The first circle is drawn to show when the BTB had a hit and predicted “taken”. The second
circle in t3 shows when the branch comparison catches a mispredict and kills two cycles.
13
Problem 5: CISC vs RISC
For each of the following questions, circle either CISC or RISC, depending on which ISA you
feel would be best suited for the situation described. Also, briefly explain your reasoning.
CISC
CISC ISAs provided more complex, higher-level instructions such as string manipulation
instructions and special addressing modes convenient for indexing tables (say for your
company’s payroll application). Two example CISC instructions: “DBcc: Test Condition,
Decrement, and Branch” and “CMP2: Compare Register against Upper and Lower Bounds”.
This made life easy if you stared at assembly all day, and couldn’t hide behind
convenient software abstractions/subroutines!
Compilers had difficulty targeting CISC ISAs in part because the complicated instructions have
many difficult and hard to analyze side-effects. A load-store/register-register RISC ISA
which limits side-effects to a single register or memory location per instruction is relatively easy
14
for a compiler to understand, analyze, and schedule for.
RISC
CISC
When instruction fetch takes 10x
longer than a CPU logic operation, you are going to want to push as much compute as you can
into each instruction! For example, a CISC instruction which performs expensive, multi-cycle
floating point routines in hardware is FAR faster than a software floating point subroutine that
requires perhaps dozens of expensive instruction fetches.
Because RISC instructions tend to have simple, easy to analyze side-effects, they lend
themselves more readily to pipelined micro-architectures which dynamically check for
dependencies between instructions and interlock or bypass when dependencies arise. And
because little work needs to be performed in each stage, the pipeline can be clocked at very high
frequencies.
This advantage is evident in modern micro-architectures of old CISC ISAs: typically the front-
end of the processor has a decoder which translates CISC instructions (e.g., x86 instructions)
into RISC “micro-ops”, which a high-performance pipeline can then dynamically schedule for
maximum performance.
For these CISC architectures such as x86 and IBM S/360, they’re still around for legacy reasons.
But if you had a chance at a clean slate, you’d probably prefer a clean RISC implementation
with a direct translation to the micro-architecture instead of using area and power on a CISC
decoder front-end (not to mention the additional complexity forced on your memory system to
handle the odd CISC addressing modes).
RISC
15
Problem 6: Iron Law of Processor Performance
16
No effect: since Decrease: if Decrease: if Improve: improving
instructions access to memory access is memory access time,
at least by these Iron
Improving memory make no Memory is on the Law metrics, will
access speed assumption pipelined (>1 critical path or increase erformance
about memory cycle) since it memory was 1 of the whole system
speed. will now take less cycle. (unless you chose “no
cycles. effect” for
everything).
d) No effect: if Of course, there
No effect: if memory is could be other
memory pipelined and just secondary costs of
access is done in takes less cycles. improving mem.
a single cycle. access speeds, like
having to use smaller
caches, but I’m
getting carried away
here.
Adding 16-bit No effect: No effect: you are Increase: decode Ambiguous: the
versions of the most because simple executing may increase this main advantage is
equivalent 16b since the instruction smaller code size,
common instructions you are replacing versions of regular format is more whichcan improve I$
in MIPS (normally 32b instructions 32b instructions. complex (and you hit rates and save on
32-bits in length) to with equivalent Both appear identical have to deal with fetch energy (get
the ISA (i.e., make 16b versions, it to the pipeline. figuring out where more instructions per
MIPS a variable saves on code decrease: since code the instruction fetch). This can
size has shrunk, I$ boundaries are) improve performance
e) length ISA) space, but it hits will increase and No effect: if this fits (or at least energy),
leaves the Inst/ thus less cycles will within the cycle time, however the more
Program count be spent fetching since this makes no complex decode
unchanged instructions change to the pipeline could also counteract
and only increases the these gains.
decode stage (or
perhaps adds another
stage to the front-
end).
For a given CISC No effect: Decrease: No effect: the Increase: it should
ISA, changing the because Microcoded amount be far easier to
implementation of the ISA is not machines take of work done in one pipeline RISC
the micro- changing, the several clock pipeline stage and uops once the
one microcode
architecture from a binary does not cycles to execute CISC instructions
cycle are about the
microcoded engine change, an instruction, same.
Have been
f) to a RISC pipeline and thus there is while the RISC Increase: the RISC decoded/translate
(with a CISC-to- no change to pipeline should pipeline introduces d, leading to a
RISC decoder on the Inst/ have a CPI near 1 longer control paths higher
front-end) Program. (thanks to and adds bypasses, performance
pipelining). which are likely to machine (see
be on the critical modern x86
path. machines).
17