Solutions: CS152 Computer Architecture and Engineering

CS152 Computer Architecture and
Engineering
SOLUTIONS
ISAs, Microprogramming and Pipelining
Assigned 8/26/2016 Problem Set #1 Due September 13
The problem sets are intended to help you learn the material, and we encourage you to
collaborate with other students and to ask questions in discussion sections and office hours to
understand the problems. However, each student must turn in his own solution to the problems.
The problem sets also provide essential background material for the quizzes. The problem sets
will be graded primarily on an effort basis, but if you do not work through the problem sets you
are unlikely to succeed at the quizzes! We will distribute solutions to the problem sets on the day
the problem sets are due to give you feedback. Homework assignments are due at the beginning
of class on the due date. Late homework will not be accepted, except for extreme circumstances
and with prior arrangement.
Problem 1: CISC, RISC, accumulator, and Stack: Comparing ISAs
In this problem, your task is to compare four different ISAs. x86 is an extended accumulator,
CISC architecture with variable-length instructions. RISC-V is a load-store, RISC architecture
with fixed-length instructions (for this problem only consider the 32-bit form of its ISA). We will
also look at a simple stack-based ISA and at an accumulator architecture.
Problem 1.A CISC
How many bytes is the program?

19
For the above x86 assembly code, how many bytes of instructions need to be fetched if b = 10?
4+10*(13)+10=144
Assuming 32-bit data values, how many bytes of data memory need to be fetched? Stored?
Fetched: the compare instruction accesses memory, and brings in a 4-byte word b+1
times:
4*11= 44
Stored: 0
Problem 1.B RISC
Many translations will be appropriate; here’s one. Other people have used sub instead of slt.
Remember (as far as we are concerned for this PS, or Lab 1, or any Quiz), RISC-V instructions
are only 32 bits long so you need to construct a 32 bit address from 12-bit and 20-bit
immediates. Also, since the problem specified that the value of b was already contained in x1,
you could skip the lui/lw instructions entirely.
x86 instruction label RISC-V instruction sequence

xor %edx,%edx
xor x4, x4, x4
xor %ecx,%ecx
xor x3, x3, x3
cmp 0x8049580,%ecx
lui x6, 0x08049
lw x1, 0x580 (x6)
slt x5, x3, x1
jl L1
bne x5,x0, L1
jmp done
j done
2
add %eax,%edx
add x4, x4, x2

inc %ecx
addi x3, x3, #1
jmp loop j loop
... done: ...
How many bytes is the RISC-V program using your direct translation?
10*4 = 40 (or 8*4=32 if you leave out the lui/lw)
How many bytes of RISC-V instructions need to be fetched for b = 10 using your direct
translation?
If you get part of the address of b into a register, you don’t have to repeat the lui. So there are 3
instructions in the prelude and 6 that are part of the loop (we don’t need to fetch the “j” until the
11 th iteration). There are 5 instructions in the 11 th iteration. All instructions are 4
bytes.
4*(3+10*6+5) = 272. If you kept b in a register, it’s smaller.
Assuming 32-bit data values, how many bytes of data memory need to be fetched? Stored?
Fetched: 11 * 4 = 44. (or zero if you keep B in a register)
Stored: 0
Problem 1.C Stack
pop a ;m[a] <- a

push 0 ;push a dummy value (mem[0]) onto stack so we
zero ; have something to zero
pop result ;m[result] <- 0 (result)
push 0 ;push a dummy value onto stack
zero
pop i ;m[i] <- 0 (i)
loop: push 0x8000 ;push b
push i
sub ;b-i
bnez L1
goto done
L1: push a
push result
add
pop result ; result = result+a
3
push i
inc ; i=i+1
pop i
goto loop
done:
How many bytes is your program?

50
Using your stack translations from part (c), how many bytes of stack instructions need to be
fetched for b = 10?
(5*3+2*1) + 10*(9*3+3*1)+(4*3+1) = 330
Assuming 32-bit data values, how many bytes of data memory need to be fetched?
fetched = 4*number of dynamic pushes. There are 2 in the prelude, 2 at loop that get executed 11
times, and 3 at L1 that get executed 10 times. 2+2*11+3*10=54. 54*4 bytes = 216 bytes
Stored?
stored = 4 * number of dynamic pops. 4*(3+2*10) = 92 bytes
Note that the stack-depth in this program never exceeds two words, so we don’t have to worry
about extra accesses for spilling.
If you could push and pop to/from a four-entry register file rather than memory (the Java virtual
machine does this), what would be the resulting number of bytes fetched and stored?
There are only four variables, so almost all memory accesses could be eliminated. If you stick to
a direct translation where you keep b in memory, then you would have to get it 11 times: 44
bytes
fetched, 0 bytes stored. If you keep b in a register, too, then you only have to get it once: 4 bytes
fetched, 0 bytes stored (but the code in 1.C’s answer doesn’t directly support this).
Problem 1.D Accumulator
zero accumulator Zero the accumulator

load B subtractor Load variable b into the subtractor
loop: dec subtractor Decrement the subtractor

add A accumulator Add the value of variable A into the accumulator
bnez loop subtractor Branch if the subtractor is non zero
goto done
done:
The above is just one way of doing it by using both the accumulator and subtractor. There is a
way to solve this problem with using just one of the two, by storing and loading values from
memory at every loop iteration, similar to the stack architecture. Both solutions are fine.
4
How many bytes is your program?
17
Can the same program be implemented with just one accumulator (i.e., no subtractor)?
There are two ways to answer this. If we don’t load and store values to and from memory at
every iteration the answer is no. By the nature of this ISA, the moment we load one variable, say
B, we cannot store another variable which is necessary to do the increments.
However, similar to the stack architecture, we can make this work with one accumulator. It just
takes more loads and stores.
If not, how would you extend this ISA to implement this program with just one accumulator?
If the answer to the above was no, we can simply add a multiply instruction, or add memory-
memory instructions (instructions that can operate on two memory addresses). However, the
latter would change the nature of the ISA.
Problem 1.E Conclusions
CISC < RISC < STACK for both static and dynamic code size.
(RISC ≈ CISC) < STACK for data memory traffic
Problem 1.F Optimization
Most optimizations revolve around the elimination of unnecessary control flow. Also, the load
can be hoisted out of the loop.
lui x6, 0x08049 ;optional if x1 already contains b

lw x1, 0x580(x6) ;optional if x1 already contains b
xor x4, x4, x4
blt x1,x0, done
loop: ddiu x1, x1, #-1
add x4, x4, x2
bgtz x1, loop
done:
This re-write brings dynamic code size down to 136 bytes; static code size to 28; and memory
traffic down to 4 bytes.
5
Problem 2: Microprogramming and Bus-Based Architectures
Problem 2.A Implementing Memory-to-Memory Add
Worksheet M1-1 shows one way to implement ADDm in microcode.
Note that to maintain “clean” behavior of your microcode, no registers in the register file should
change their value during execution (unless they are written to). This does not refer to
the registers in the datapath (IR, A, B, MA). Thus, using asterisks for the load signals (ldIR,
ldA, ldB, and ldMA) is acceptable as long as the correctness of your microcode is not affected
(and in fact, should be done for full optimality). Also note the ubr to FETCH0 must be contained
on its own line, since you can’t “spin” on the same micro-code line if memory is still busy OR
jump to FETCH0 if memory is not busy. S is either “spin on same micro-code line (upc)” or go
to upc+1.
When performing a memory access, you could be “spinning” for many cycles, waiting for
memory to become “not busy”. In that time, you must keep all inputs to the memory system
constant: thus, ldMA must be 0, because you don’t want the memory address to change while
accessing memory! Likewise, in “Mem <- A+B”, ldA and ldB must also be set to 0, so that the
data being sent to memory stays constant. To phrase this is another way, we have no idea when
the memory system latches in our inputs.
Finally, note the cleverness of ldA being “0” on FETCH2. On entering an instruction, A always
equals PC+4. This saves a cycle if we dispatch to a jump instruction, which first loads PC+4
into the ALU (or the RDNPC instruction, which loads PC+4 into rd).
The microcode for ADDm is straightforward.
6
7
Problem 2.B Implementing MOVN Instruction
Worksheet M1-2 shows one way to implement STRCPY in microcode.
A few notes:
-LdIR is zero for all uops because we keep needing to read the actual values of Rs, Rd which are
stored in the IR register
-ldMA is kept at 0 when performing a memory operation because memory operations are multi-
cycle and thus you need to hold the memory address constant (this logic also applies to ldA,ldB
when used as sources for memory).
8
Problem 2.C Instruction Execution Times
How many cycles does it take to execute the following instructions in the microcoded RICV-V
machine? Use the states and control points from RISC-V-Controller-2 in Lecture 2 (or Lab 1, in
${LAB1ROOT}/src/rv32_ucode/micrcode.scala) and assume Memory will not assert its busy
signal.
Instruction Cycles
ADD x3,x2,x1 3+3=6
ADDI x2,x1,#4 3+3=6
SW x1,0(x2) 3+5=8
BNE x1,x2,label #(x1 == x2) 3 + 4 = 7
BNE x1,x2,label #(x1 != x2) 3 + 3 + 4 = 10
BEQ x1,x2,label #(x1 != x2) 3 + 3 + 4 = 10
BEQ x1,x2,label #(x1 != x2) 3 + 4 = 7
J label 3+4=7
JAL label 3+5=8
JALR x1 3+5=8
As discussed in Lecture 2, instruction execution includes the number of cycles needed to fetch
the instruction. The lecture notes used 4 cycles for the fetch phase, while Worksheet 1 shows
that this phase can actually be implemented in 3 cycles —either answer was fine. The above
table uses 3 cycles for the fetch phase.
The above answers are derived from the micro-coded processor provided in Lab 1. It is okay if
your answers differ from having been derived from the lecture notes.
Overall, BNE (for a taken branch), and BEQ (for a taken branch) take the most cycles to execute
(10), while arithmetic functions such as ADD and ADDI take the fewest cycles (6).
9
Problem 3: 6-Stage Pipeline
Problem 3.A Hazards: Second Write Port
The second write port improves performance by resolving some RAW hazards earlier than they
would be if ALU operations had to wait until writeback to provide their results to subsequent
dependent instructions. It would help with the following instruction sequence:
add x1, x2, x3

add x4, x5, x6
add x7, x1, x9
The important insight is that the second write port cannot resolve data hazards for immediately
back-to-back instructions. (Recall that the RF is read in the ID stage, and when after the first
instruction has written back, it is in M1, so the third instruction is in ID.)
Problem 3.B Hazards: Bypasses Removed
The bypass path from the end of M1 to the end of ID can be removed. (Credit was also given for
the bypass path from the beginning of M2 to the beginning of EX, since these are equivalent.)
Additionally, ALU results no longer have to be bypassed from the end of M2 or the end of WB,
but these bypass paths are still used to forward load results to earlier stages.
Problem 3.C Precise Exceptions
Illegal address exceptions are not detected until the start of the M2 stage. Since writebacks can
occur at the end of the EX stage, it is possible for an ALU op following a memory access to an
illegal address to have written its value back before the exception is detected, resulting in an
imprecise exception. For example:
lw x1, -1(x0) // address -1 is misaligned

add x2, x3, x4 // x2 will be overwritten, even though preceding instruction has faulted
Problem 3.D Precise Exceptions: Implemented using a Interlock
10
Stall any ALU op in the ID stage if the instruction in the EX stage is a load or a store. The
instruction sequence above engages this interlock.
Loads and stores account for about one-third of dynamic instructions. Assuming that the
instruction following a load or store is an ALU op two-thirds of the time, and ignoring
the existing load-use delay, this solution will increase the CPI by (1/3)*(2/3)==2/9. However,
only a qualitative explanation was necessary for credit.
Problem 3.E Precise Exceptions: Implemented using an Extra Read Port
In addition to reading an instruction’s source operands in the ID stage, also read the destination
register, rd. If an early writeback occurs before a preceding exception was detected, then the old
value of rd is preserved in the EX/M1 pipeline register and can be restored to the register file,
maintaining precise state.
Problem 4: Branch Speculation
Problem 4.A Motivating Branch Speculation
The point in time that a branch comparison occurs is circled above. The second circle (OR) is
when the decode stage recognizes a dependent load in the EXE stage (at t5) and stalls.
Problem 4.B Motivating Branch Speculation (2)
11
The IF_KILL and DEC_KILL signal goes out in t5, when the “mispredict” is discovered.
Bubbles are inserted into the pipeline, and show up on t6.
Problem 4.C Adding a BHT
Loop is not taken. The BHT is “strongly” taken, so BHT predicts “taken” when we see BEQ.
BHT is in Decode, and Fetch stage always predicts PC+4, so we eat 1 cycle when the BHT
predicts taken branch, and we eat another cycle if BHT predicts taken, but branch is actually not
taken (i.e., it just degregates to the original 2-cycle branch penalty). At t4 (as circled), the
Decode stage kills the fetch stage to redirect it down the “taken” path. However, at t5 we resolve
the branch comparison in Execute and must correct for the BHT’s misprediction, and kill Fetch
and Decode.
12
Problem 4.D Adding a BTB
BTB mispredicts the exit, and it takes two cycles for branch logic in Exe to catch the mistake.
The first circle is drawn to show when the BTB had a hit and predicted “taken”. The second
circle in t3 shows when the branch comparison catches a mispredict and kills two cycles.
13
Problem 5: CISC vs RISC
For each of the following questions, circle either CISC or RISC, depending on which ISA you
feel would be best suited for the situation described. Also, briefly explain your reasoning.
Problem 5.A Lack of Good Compilers I
CISC
CISC ISAs provided more complex, higher-level instructions such as string manipulation
instructions and special addressing modes convenient for indexing tables (say for your
company’s payroll application). Two example CISC instructions: “DBcc: Test Condition,
Decrement, and Branch” and “CMP2: Compare Register against Upper and Lower Bounds”.
This made life easy if you stared at assembly all day, and couldn’t hide behind
convenient software abstractions/subroutines!
Problem 5.B Lack of Good Compilers II
Compilers had difficulty targeting CISC ISAs in part because the complicated instructions have
many difficult and hard to analyze side-effects. A load-store/register-register RISC ISA
which limits side-effects to a single register or memory location per instruction is relatively easy
14
for a compiler to understand, analyze, and schedule for.
RISC
Problem 5.C Fast Logic, Slow Memory
CISC
When instruction fetch takes 10x
longer than a CPU logic operation, you are going to want to push as much compute as you can
into each instruction! For example, a CISC instruction which performs expensive, multi-cycle
floating point routines in hardware is FAR faster than a software floating point subroutine that
requires perhaps dozens of expensive instruction fetches.
Problem 5.D Higher Performance(?)
Because RISC instructions tend to have simple, easy to analyze side-effects, they lend
themselves more readily to pipelined micro-architectures which dynamically check for
dependencies between instructions and interlock or bypass when dependencies arise. And
because little work needs to be performed in each stage, the pipeline can be clocked at very high
frequencies.
This advantage is evident in modern micro-architectures of old CISC ISAs: typically the front-
end of the processor has a decoder which translates CISC instructions (e.g., x86 instructions)
into RISC “micro-ops”, which a high-performance pipeline can then dynamically schedule for
maximum performance.
For these CISC architectures such as x86 and IBM S/360, they’re still around for legacy reasons.
But if you had a chance at a clean slate, you’d probably prefer a clean RISC implementation
with a direct translation to the micro-architecture instead of using area and power on a CISC
decoder front-end (not to mention the additional complexity forced on your memory system to
handle the odd CISC addressing modes).
RISC
15
Problem 6: Iron Law of Processor Performance
Instructions / Cycles / Seconds / Cycle Overall

Program Instruction Performance
Increase: Nops Decrease: Some No effect: doesn’t Ambiguous:
must be inserted control hazards change pipeline Depends on the
Adding a branch when the branch are eliminated; program and how
delay slot delay slot cannot also additional Decrease: often the delay
a)
be usefully NOPs execute branch_kill signal slot can be filled
filled. quickly is no longer with useful work
because they have needed
no data hazards.
Decrease: if the Increase: if Increase: since Ambiguous: if the
added instruction implementing the more logic and program can take
Adding a complex can replace a instruction means thus longer advantage of the
instruction sequence of adding or re-using critical path. new instruction, it
instructions. stages. can mitigate the
b) No effect: if it is costs of
No effect: if it is No effect: if the implemented by implementing it.
unusable. number of cycles more or re-used This is a hard
is kept constant stages but each decision for an
but it just stage gets no ISA designer to
lengthens the longer. make!
logic in one stage.
Increase: values Increase: more Decrease: fewer Ambiguous: if the
Will more loads followed by registers means program uses few
registers and thus
Reduce number of frequently be dependent shorter register spills rarely to
registers in the ISA spilled to the instructions, will file access time memory, the faster
stack, increasing cause stalls, and reg. access times may
c) number of loads likely be difficult win out. Also, your
and stores to schedule instructions may be
able to be shorter,
around improving amongst
other things code
density and I$ hit-
rates.
16
No effect: since Decrease: if Decrease: if Improve: improving
instructions access to memory access is memory access time,
at least by these Iron
Improving memory make no Memory is on the Law metrics, will
access speed assumption pipelined (>1 critical path or increase erformance
about memory cycle) since it memory was 1 of the whole system
speed. will now take less cycle. (unless you chose “no
cycles. effect” for
everything).
d) No effect: if Of course, there
No effect: if memory is could be other
memory pipelined and just secondary costs of
access is done in takes less cycles. improving mem.
a single cycle. access speeds, like
having to use smaller
caches, but I’m
getting carried away
here.
Adding 16-bit No effect: No effect: you are Increase: decode Ambiguous: the
versions of the most because simple executing may increase this main advantage is
equivalent 16b since the instruction smaller code size,
common instructions you are replacing versions of regular format is more whichcan improve I$
in MIPS (normally 32b instructions 32b instructions. complex (and you hit rates and save on
32-bits in length) to with equivalent Both appear identical have to deal with fetch energy (get
the ISA (i.e., make 16b versions, it to the pipeline. figuring out where more instructions per
MIPS a variable saves on code decrease: since code the instruction fetch). This can
size has shrunk, I$ boundaries are) improve performance
e) length ISA) space, but it hits will increase and No effect: if this fits (or at least energy),
leaves the Inst/ thus less cycles will within the cycle time, however the more
Program count be spent fetching since this makes no complex decode
unchanged instructions change to the pipeline could also counteract
and only increases the these gains.
decode stage (or
perhaps adds another
stage to the front-
end).
For a given CISC No effect: Decrease: No effect: the Increase: it should
ISA, changing the because Microcoded amount be far easier to
implementation of the ISA is not machines take of work done in one pipeline RISC
the micro- changing, the several clock pipeline stage and uops once the
one microcode
architecture from a binary does not cycles to execute CISC instructions
cycle are about the
microcoded engine change, an instruction, same.
Have been
f) to a RISC pipeline and thus there is while the RISC Increase: the RISC decoded/translate
(with a CISC-to- no change to pipeline should pipeline introduces d, leading to a
RISC decoder on the Inst/ have a CPI near 1 longer control paths higher
front-end) Program. (thanks to and adds bypasses, performance
pipelining). which are likely to machine (see
be on the critical modern x86
path. machines).
17

Solutions: CS152 Computer Architecture and Engineering

Uploaded by

Copyright:

Available Formats

Solutions: CS152 Computer Architecture and Engineering

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Solutions: CS152 Computer Architecture and Engineering

Uploaded by

Copyright:

Available Formats

CS152 Computer Architecture and

Problem 1.A CISC

How many bytes is the program?

Problem 1.B RISC

x86 instruction label RISC-V instruction sequence

add x4, x4, x2

jmp loop j loop

... done: ...

Problem 1.C Stack

pop a ;m[a] <- a

How many bytes is your program?

Problem 1.D Accumulator

zero accumulator Zero the accumulator

loop: dec subtractor Decrement the subtractor

Problem 1.E Conclusions

Problem 1.F Optimization

lui x6, 0x08049 ;optional if x1 already contains b

Problem 2.A Implementing Memory-to-Memory Add

Worksheet M1-1 shows one way to implement ADDm in microcode.

The microcode for ADDm is straightforward.

Worksheet M1-2 shows one way to implement STRCPY in microcode.

Problem 3.A Hazards: Second Write Port

add x1, x2, x3

Problem 3.B Hazards: Bypasses Removed

Problem 3.C Precise Exceptions

lw x1, -1(x0) // address -1 is misaligned

Problem 3.D Precise Exceptions: Implemented using a Interlock

Problem 3.E Precise Exceptions: Implemented using an Extra Read Port

Problem 4: Branch Speculation

Problem 4.A Motivating Branch Speculation

Problem 4.B Motivating Branch Speculation (2)

Problem 4.C Adding a BHT

Problem 5.A Lack of Good Compilers I

Problem 5.B Lack of Good Compilers II

Problem 5.C Fast Logic, Slow Memory

Problem 5.D Higher Performance(?)

Instructions / Cycles / Seconds / Cycle Overall

You might also like