CS Note

What is Cache Mapping?
As we know that the cache memory bridges the mismatch of speed between the main memory and the
processor. Whenever a cache hit occurs,
 The word that is required is present in the memory of the cache.
 Then the required word would be delivered from the cache memory to the CPU.
And, whenever a cache miss occurs,
 The word that is required isn’t present in the memory of the cache.
 The page consists of the required word that we need to map from the main memory.
 We can perform such a type of mapping using various different techniques of cache mapping.
Let us discuss different techniques of cache mapping in this article.
Process of Cache Mapping
The process of cache mapping helps us define how a certain block that is present in the main memory
gets mapped to the memory of a cache in the case of any cache miss.
In simpler words, cache mapping refers to a technique using which we bring the main memory into the
cache memory. Here is a diagram that illustrates the actual process of mapping:
Now, before we proceed ahead, it is very crucial that we note these points:
Important Note:
 The main memory gets divided into multiple partitions of equal size, known as the frames or
blocks.
 The cache memory is actually divided into various partitions of the same sizes as that of the
blocks, known as lines.
 The main memory block is copied simply to the cache during the process of cache mapping, and
this block isn’t brought at all from the main memory.
Techniques of Cache Mapping
One can perform the process of cache mapping using these three techniques given as follows:
1. K-way Set Associative Mapping
2. Direct Mapping
3. Fully Associative Mapping
1. Direct Mapping
In the case of direct mapping, a certain block of the main memory would be able to map a cache only up
to a certain line of the cache. The total line numbers of cache to which any distinct block can map are
given by the following:
Cache line number = (Address of the Main Memory Block ) Modulo (Total number of lines in Cache)
For example,
 Let us consider that particular cache memory is divided into a total of ‘n’ number of lines.
 Then, the block ‘j’ of the main memory would be able to map to line number only of the cache (j
mod n).
The Need for Replacement Algorithm

In the case of direct mapping,
 There is no requirement for a replacement algorithm.
 It is because the block of the main memory would be able to map to a certain line of the cache
only.
 Thus, the incoming (new) block always happens to replace the block that already exists, if any,
in this certain line.
Division of Physical Address
In the case of direct mapping, the division of the physical address occurs as follows:
2. Fully Associative Mapping

In the case of fully associative mapping,
 The main memory block is capable of mapping to any given line of the cache that’s available
freely at that particular moment.
 It helps us make a fully associative mapping comparatively more flexible than direct mapping.
For Example
Let us consider the scenario given as follows:
Here, we can see that,
 Every single line of cache is available freely.
 Thus, any main memory block can map to a line of the cache.
 In case all the cache lines are occupied, one of the blocks that exists already needs to be replaced.
In the case of fully associative mapping,
 The replacement algorithm is always required.
 The replacement algorithm suggests a block that is to be replaced whenever all the cache lines
happen to be occupied.
 So, replacement algorithms such as LRU Algorithm, FCFS Algorithm, etc., are employed.
In the case of fully associative mapping, the division of the physical address occurs as follows:
3. K-way Set Associative Mapping

In the case of k-way set associative mapping,
 The grouping of the cache lines occurs into various sets where all the sets consist of k number of
lines.
 Any given main memory block can map only to a particular cache set.
 However, within that very set, the block of memory can map any cache line that is freely
available.
 The cache set to which a certain main memory block can map is basically given as follows:
Cache set number = ( Block Address of the Main Memory ) Modulo (Total Number of sets present in
the Cache)
For Example
Let us consider the example given as follows of a two-way set-associative mapping:
In this case,
 k = 2 would suggest that every set consists of two cache lines.
 Since the cache consists of 6 lines, the total number of sets that are present in the cache = 6 / 2 =
3 sets.
 The block ‘j’ of the main memory is capable of mapping to the set number only (j mod 3) of the
cache.
 Here, within this very set, the block ‘j’ is capable of mapping to any cache line that is freely
available at that moment.
 In case all the available cache lines happen to be occupied, then one of the blocks that already
exist needs to be replaced.
In the case of k-way set associative mapping,
 The k-way set associative mapping refers to a combination of the direct mapping as well as the
fully associative mapping.
 It makes use of the fully associative mapping that exists within each set.
 Therefore, the k-way set associative mapping needs a certain type of replacement algorithm.
In the case of fully k-way set mapping, the division of the physical address occurs as follows:
Special Cases
 In case k = 1, the k-way set associative mapping would become direct mapping. Thus,
Direct Mapping = one-way set associative mapping
 In the case of k = The total number of lines present in the cache, then the k-way set associative
mapping would become fully associative mapping.
Video on Cache Memory
What is cache mapping and its type?
Cache mapping refers to a technique using which the content present in the main memory is brought into
the memory of the cache. Three distinct types of mapping are used for cache memory mapping: Direct,
Associative and Set-Associative mapping.
Why do we need cache mapping?
The Cache Memory refers to a special, very high-speed memory that is used when we want to speed up
and synchronize with a high-speed CPU. The cache holds data and instructions that are frequently
requested so that they are available immediately to the CPU as and when needed. The cache memory is
used to reduce the overall average time that is required to access data and information from the main
memory.
What is cache in simple terms?
A cache, in simpler words, refers to a block of memory used for storing data that is most likely used
again. The hard drive and CPU often make use of a cache, just like the web servers and web browsers
do. Any cache is made up of numerous entries, known as a pool.
What are the 3 types of cache memory?
The three types of general cache are:
 The L1 cache, also known as the primary cache, is very fast, but it is relatively small. It is
embedded usually in the processor chip in the form of the CPU cache.
 The secondary cache, also known as the L2 cache, is often comparatively more capacious than
the L1 cache.
 The Level 3 (or L3) cache refers to a specialized memory that is developed in order to improve
the actual performance of the L1 and L2.
1. What is a register file?
A state element that consists of a set of registers that can be read and written to by supplying a register
number to be accessed. It contains the state of all the registers of the computer.
2. Which access does a register file rely on to properly function?

Read and Write. You need to read from the file to see what registers are open, you need to write to the
file to update the state when you fill a register.
3. What is the binary representation of instructions?
Machine Language.
Machines only read binary, they do not “see” nor do they “think”. They really only store data, read/write
data, and perform operations on data at the end of the day.
How is the computer going to most quickly read/write this data? As an electrical charge. This is why the
machine language is binary, because electricity has only two charges, positive and negative.
You can even think of it that the machines “language” is electricity if that will help you to remember why
binary is the “machine language”.
4. Which feature of the IBM 360/91 was incorporated into the majority of microprocessors
developed in the 21st century?
Algorithms allowed the improved parallel execution of instructions.
To achieve supercomputer speeds, the 360/9x models pioneered new concepts such as instruction
pipelining and lookahead, branch prediction, cache memory, overlap, and parallelism.
But basically, the thing you need to remember here, is that pipelining as a tool is still ultimately in service
of paralellism. Paralellism is the major feature of the microprocessors developed in the 21st century.
Which two elements are required to implement R-Format arithmetic logic unit (ALU) operations?
ALU and register file.
The ALU actually does the operation, the register file specifies which registers hold the data to be
operated on and the register to write the result to.
R-Format Instruction: Reads two registers, performs an ALU operation on the contents of the registers,
and writes the result to a register. Also called “R-Type Instructions” or “Arithmetic-Logical Instructions”.
Examples include ADD, SUB, AND, and ORR.
Example:
ADD X1, X2, X3
The value in register X2 and and the value in register X3 are added together on the ALU and the sum
written into register X1.
Register File — A state element that consists of a set of registers that can be read and written by
supplying a register number to be accessed.
ORR — 10.69 ORR
Logical OR.
Syntax
ORR{S}{cond} Rd, Rn, Operand2
where:
S
is an optional suffix. If S is specified, the condition flags are updated on the result of the operation.
cond
is an optional condition code.
Rd
is the destination register.
Rn
is the register holding the first operand.
Operand2
is a flexible second operand.
Operation
The ORR instruction performs bitwise OR operations on the values in Rn and Operand2.
In certain circumstances, the assembler can substitute ORN for ORR, or ORR for ORN. Be aware of this
when reading disassembly listings.
10. What is an ALU?
Hardware that performs addition, subtraction, and usually logical operations such as AND and OR
What is a Datapath?
The component of the processor that performs arithmetic operations.
What is in a datapath?
[31] DataPath Wikipedia.
A datapath is a collection of functional units such as arithmetic logic units or multipliers that
perform data processing operations, registers, and buses. … A data path is the ALU, the set of
registers, and the CPU’s internal bus(es) that allow data to flow between them.
12. Which component of a computer moderates the action of its other components?
Control — the component of the processor that commands the datapath, memory, and I/O devices
according to the instructions of the program.
The control commands the datapath.
In computer architecture , the control unit is defined as an important component of the central processing
unit ( CPU ) that controls and directs all the operations of the computer system. The microprocessor
is considered to be the brain of the computer system. [32]
The processor internally consist of three functional units. The CPU funtional units include control unit
(CU), arithmetic and logic unit (ALU) and the memory unit (MU). [32]
[32]
13. Binary Addition
Given the following 8-bit integer binary variables:
X1 = 11000110
X2 = 11110111
What is the value in X3 after the following command?

ADD X3, X2, X1
Overflow
Overflow Rule for addition
If 2 Two’s Complement numbers are added, and they both have the same sign (both positive or both
negative), then overflow occurs if and only if the result has the opposite sign. Overflow never occurs
when adding operands with different signs.
Overflow Detection —
Overflow occurs when:
1. Two negative numbers are added and an answer comes positive or
2. Two positive numbers are added and an answer comes as negative.
So overflow can be detected by checking Most Significant Bit(MSB) of two operands and answer. But
Instead of using 3-bit Comparator Overflow can also be detected using 2 Bit Comparator just by checking
Carry-in(C-in) and Carry-Out(C-out) from MSB’s. Consider N-Bit Addition of 2’s Complement number.
[16]
Two’s complement is the most common method of representing signed integers on computers [15]
Wait, did we drop the last carry in bit? WTF.
The last carry was dropped because it does not fit in the target space. It would be the fifth bit.
If he had carried out the same addition, but with for example 8 bit storage, it would have looked like this:
00000100
11111101
--------
00000001
In this situation we would also be stuck with an “unused” carry.
We have to treat carries this way to make addition with two’s compliment work properly, but that’s all
good, because this is the easiest way of treating carries when you have limited storage. [17]
What does extending to 16 bits yield given -5 in 8-bit 2’s complement 11111011?
11111111 11111011
To get the two’s complement negative notation of an integer, you write out the number in binary. You
then invert the digits, and add one to the result. [18]
So let’s start with the number 5
0000 0101
Now we invert it
1111 1010
and now we add 1 to the result
1111 1011
Alrighty, now the number is 8 digit 2’s complement, all we need to do now is add 8 more places to the
left side (Maxixum Bit Side) and call it done.
1111 1111 1111 1011
17. Pipelining Example Problem
How many minutes does it take to wash, dry, and fold four loads of laundry using a pipelining approach,
given the following information?
 One washer takes 30 minutes
 One dryer takes 40 minutes
 One folder takes 20 minutes
210 minutes.
WRONG
RIGHT
18. What is meant by pipelining in computer architecture?
Pipelining is a technique where multiple instructions are overlapped in execution. It is a strategy that we
apply in order to create parallel programming implementations.
18. What is temporal locality?
The principle stating that if a data location is referenced then it will tend to be referenced again soon.
19. What is spatial locality?
The locality principle stating that if a data location is referenced, data locations with nearby addresses
will tend to be referenced soon.
20. What’s the principle of locality?
The principle of locality states that programs access a relatively small portion of their address space at
any instant of time.
21. What’s miss penalty?
The time required to fetch a block into a level of the memory hierarchy from the lower level, including
the time to access the block, transmit it from the one level to the other, insert it in the level that
experienced the miss, and then pass the block to the requestor.
Memory Hierarchy [3]
22. What does each bank of modern DRAMS consist of?
Rows
DRAM — Memory built as an integrated circuit; it provides random access to any location. Access times
are 50 nanoseconds and cost per gigabyte in 2012 was $5-$10. More dense (has more bits per unit area)
than SRAM, but not as fast. It is volatile memory that needs power to retain any data. It clears after every
reboot or loss of power. Main Memory, the memory used to hold programs while they are running,
typically consists of DRAM in todays computers.
The value kept in a cell in Dynamic RAM is stored as a charge in a capacitor. A single transistor is then
used to access this stored charge, either to read the value or to overwrite the charge stored there. Because
DRAMs use only one transistor per bit of storage, they are much denser and cheaper per bit than SRAM.
As DRAMs store the charge on a capacitor, it cannot be kept indefinitely and must be periodically be
refreshed. That is why this memory structure is called dynamic.
To refresh we just read the cell contents and then write it back. The charge keeps for a few milliseconds,
so that’s how long we have to write it back before we lose the data. It would take too long to do this for
each cell so DRAMs use a two-level decoding structure that lets us refresh an entire Row (a whole word
line) at a time.
The row organization that helps with refresh also helps with performance. To improve performance,
DRAMs buffer rows for repeated access. The buffer acts like an SRAM; by changing the address,
random bits can be accessed in the buffer until the next row access. This capability improves the access
time significantly, since the access time to bits in the row is much lower. Making the chip wider also
improves the memory bandwidth of the chip. When the row is in the buffer, it can be transferred by
successive addresses at whatever the width of the DRAM is (typically 4, 8, or 16 bits), or by specifying a
block transfer and the starting address within the buffer.
23. Is SRAM or DRAM used to implement memory levels closest to the processor?
SRAM- because SRAM is faster. SRAMs are typically faster, so are used to implement memory levels
closer to the processor. SRAM access times typically range from 0.5 to 2.5 ns, while DRAM access times
typically range from 50 to 70 ns.
24. Which has fewer transistors per bit of memory, SRAM or DRAM?
DRAM — that’s why it’s cheaper, transistors cost money, and not as close to the CPU, and slower.
The architectural difference between the two is that DRAM uses transistors and capacitors in an array of
repeating circuits (where each circuit is one bit), whereas SRAM uses several transistors in a circuit to
form one bit. [19]
25. Which type of RAM requires a periodic refresh, SRAM or DRAM?
DRAM — hence the word dynamic
26. What is superscalar as it relates to parallelization?
Denoting a computer architecture where several instructions are loaded at once and, as far as possible, are
executed simultaneously, shortening the time taken to run the whole program.
Superscalar describes a microprocessor design that makes it possible for more than one instruction at a
time to be executed during a single clock cycle .
27. Example ARM Instruction Problem:
The value of b is stored in r1, c is stored in r2, and a is stored in r0.
Which set of ARM instructions will accomplish a = b & c?
→ AND r0, r1, r2
OR r0, r1, r2
EOR r0, r1, r2
ORR r0, r1, r2
28. Example ARM Instruction Problem:
X1: A
X2: B
X3: C
Which set of ARM instructions will accomplish A=B+C?

ADD X3, X2, X1
ADD X1, X2 #X3
ADD X3 #X2, X1
→ ADD X1, X2, X3
29. What are the W0, W1… Wx registers used for in ARMV8?
These registers are used when executing 32 bit instructions, as opposed to the typical X0, X1, X2 … Xn
instructions when executing 64 bit instructions.
For example turn this operation:
ADD X9, X21, X9
into a 32 bit operation by doing
ADD W9, W21, W9
instead.
30. Converting Assembly to C
The variables f and g are assigned to the registers X3 and X4, respectively in the ARMv8 instructions.”
Loop: SUBS XZR, X3, X4
B.GE Exit
LSL X3, X3, 1
B Loop
Exit:
What are the corresponding statements in the C language?
Answer:
while(f<g){f = f<<1;}
An S at the end operation causes the ALU flags to be updated. [20] An S can be added to the operation to
set flags. For example, ADD becomes ADDS. This tells the processor to update the ALU flags based on
the result of instruction. [20]
The ALU flags are set as a side effect of data-processing instructions. To recap, an S at the end operation
causes the ALU flags to be updated. [20]
This is an example of an instruction in which the
ALU flags are not updated:
ADD X0, X1, X2
This is an example of an instruction in which the ALU flags are updated with the use of S:
ADDS X0, X1, X2
This method allows software to control when the flags are updated or not updated. The flags can be used
by subsequent condition instructions. [20]
The flags can be used by subsequent condition instructions. Let’s take the following code as an example:
SUBS X0, X5, #1
AND X1, X7, X9
B.EQ label
The SUBS instruction performs a subtract and updates the ALU flags. Then the AND
instruction performs an and operation, and does not update the ALU flags. Finally, the B.EQ
instruction performs a conditional branch, using flags set as result of the subtract. The flags are:
• N — Negative
• C — Carry
• V — Overflow
• Z — Zero
Let’s take the Z flag as an example. If the result of the operation was zero, then the Z flag is set
to 1. For example, here the Z flag will be set if X5 is 1, otherwise it will be cleared:
SUBS X0, X5, #1
The condition codes map on to these flags and come in pairs. Let’s take EQ (equal) and NE (not
equal) as an example, and see how they map to the Z flag.
The EQ code checks for Z==1. The NE code checks for Z==0
10.49 LSL
Logical Shift Left. This instruction is a preferred synonym for MOV instructions with shifted register
operands. [34]
Syntax
LSL{S}{cond} Rd, Rm, Rs
LSL{S}{cond} Rd, Rm, #sh
where:
Sis an optional suffix. If S is specified, the condition flags are updated on the result of the operation.[34]
Rdis the destination register.[34]
Rmis the register holding the first operand. This operand is shifted left. [34]
Rsis a register holding a shift value to apply to the value in Rm. Only the least significant byte is used.
[34]
shis a constant shift. The range of values permitted is 0–31. [34]
Operation
LSL provides the value of a register multiplied by a power of two, inserting zeros into the vacated bit
positions. [34]
What is virtual memory?
Virtual memory uses hardware and software to allow a computer to compensate for physical memory
shortages, by temporarily transferring data from random access memory (RAM) to disk storage. In
essence, virtual memory allows a computer to treat secondary memory as though it were the main
memory.
Virtual memory uses both computer hardware and software to work. When an application is in use, data
from that program is stored in a physical address using RAM. More specifically, virtual memory will
map that address to RAM using a memory management unit (MMU).
The OS will make and manage memory mappings by using page tables and other data structures. The
MMU, which acts as an address translation hardware, will automatically translate the addresses.
If at any point later the RAM space is needed for something more urgent, the data can be swapped out of
RAM and into virtual memory. The computer’s memory manager is in charge of keeping track of the
shifts between physical and virtual memory. If that data is needed again, a context switch can be used to
resume execution again.
While copying virtual memory into physical memory, the OS divides memory into pagefiles or swap
files with a fixed number of addresses. Each page is stored on a disk, and when the page is needed, the
OS copies it from the disk to main memory and translates the virtual addresses into real addresses.
Why do 64 bit processors that run x86–64 or ARMv8 only have a 48 bit address space?
Because that’s all that’s needed. 48 bits give you an address space of 256 terabyte. That’s a lot. You’re
not going to see a system which needs more than that any time soon.
So CPU manufacturers took a shortcut. They use an instruction set which allows a full 64-bit address
space, but current CPUs just only use the lower 48 bits. The alternative was wasting transistors on
handling a bigger address space which wasn’t going to be needed for many years.
So once we get near the 48-bit limit, it’s just a matter of releasing CPUs that handle the full address
space, but it won’t require any changes to the instruction set, and it won’t break compatibility.
37. What is the number of bits used in virtual memory with ARMv8?
48.
It’s tempting to say 64, since regular physical memory addresses are often 64 bits in 64 bit computing
architectures, however, not all 64-bit instruction sets support full 64-bit virtual memory addresses; x86–
64 and ARMv8, for example, support only 48 bits of virtual address, with the remaining 16 bits of the
virtual address required to be all 0’s or all 1’s, and several 64-bit instruction sets support fewer than 64
bits of physical memory address.
38. What is LEGv8 Architecture?
LEGv8 is a simple subset of the ARMv8 AArch64 architecture. It is a 64 bit architecture that uses 32 bit
instructions. It has 32 registers each 64 bits wide; one of them always zero. It differs from the Von
Neumann architecture in which there is a single memory. All files are in public domain.
39. Which register will be populated with the reason for an exception in LEGv8 architecture?
ESR — Exception Syndrome Register
The ESR register holds a field that indicates the reason for the exception.
40. What is a page table?
A page table is the data structure used by a virtual memory system in a computer operating system to
store the mapping between virtual addresses and physical addresses. Virtual addresses are used by the
program executed by the accessing process, while physical address are used by the hardware, or more
specifically by the RAM subsystem. The page table of virtual address translation which is necessary to
access data in memory.
41. What is a page table register?
A page table base register (PTBR) holds the base address for the page table of the current process. It is a
processor register that is managed by the operating system. To support this a processor that supports
virtual memory must have a page table base register that is accessible by the operating system.
[24] Virtual Memory Diagram https://www.usna.edu/Users/cs/crabbe/SI411/current/memory/vm.html
42. Which statement about the operating system describes how virtual memory is allocated in
ARM architecture?
→ It loads the page table register to refer to the page table of the process.
It uses a reference bit to refer to the page table of the process
It loads the entire page table to reference the process
It uses a limit register to refer to the page table of the process.
[25] Paging Systems. https://people.cs.rutgers.edu/~pxk/416/notes/10-paging.html
43. What is a virtual address space?
In computing a virtual address space (VAS) is the set of ranges of virtual addresses that an operating
system makes available to a process. The range of virtual addresses usually starts at a low address and
can extend to the highest address allowed by the computers instruction set architecture.
Virtual addresses need to be translated to physical addresses. Virtual addresses reference anywhere in
memory, not just in “virtual memory”
44. What is a translation-lookaside buffer?
Translation lookaside buffer (TLB) is a memory cache that is used to reduce the time taken to access a
user memory location. It is a part of the chip’s memory management unit. (MMU)
The TLB stores recent translations of virtual memory to physical memory and can be called an address-
translation cache. A TLB may reside between the CPU and the CPU cache, between CPU cache and the
main memory, or between the different levels of the multi-level cache. The majority of desktop, laptop
and server processors include one or more TLBs in the memory-management hardware, and it is nearly
always present in any processor that utilizes paged or segmented virtual memory.
44. What is used by virtual memory to increase performance?
→ Translation-lookaside buffer
Sparse Memory
Demand Paging
Page Size
This buffer basically is caching recently translated addresses so that they may not need to be translated
over and over again. Using the principle of temporality we can infer that a recently accessed address is
likely to be accessed again in the near future.
45. What maps virtual memory to real memory by using page tables?
→ Each guest operating system manages virtual memory independently
The host operating system manages each virtual machines virtual memory
The hypervisor software manages each virtual machine’s virtual memory
Guest operating systems prohibit the use of virtual memory
What is Amdahl’s Law?
In 1967, Amdahl pointed out that the speedup is limited by the fraction of the serial part of the software
that is not amenable to parallelization [1]. Amdahl’s law can be formulated as follows
speedup = 1 / (s + p / N)
where s is the proportion of execution time spent on the serial part, p is the proportion of execution time
spent on the part that can be parallelized, and N is the number of processors.
Amdahl’s law states that, for a fixed problem, the upper limit of speedup is determined by the serial
fraction of the code.
This is called strong scaling and can be explained by the following example.
Consider a program that takes 20 hours to run using a single processor core. If a particular part of the
program, which takes one hour to execute, cannot be parallelized (s = 1/20 = 0.05), and if the code that
takes up the remaining 19 hours of execution time can be parallelized (p = 1 − s = 0.95), then regardless
of how many processors are devoted to a parallelized execution of this program, the minimum execution
time cannot be less than that critical one hour. Hence, the theoretical speedup is limited to at most 20
times (when N = ∞, speedup = 1/s = 20). As such, the parallelization efficiency decreases as the amount
of resources increases. For this reason, parallel computing with many processors is useful only for highly
parallelized programs.
Weak Scaling is not bound my Amdahls law because it bound instead by Gustafson’s Law. [27]
[27] AmDahl and Gustafsens
Law. https://www.cse-lab.ethz.ch/wp-content/uploads/2018/11/amdahl_gustafson.pdf
[27] AmDahl and Gustafsens

Law. https://www.cse-lab.ethz.ch/wp-content/uploads/2018/11/amdahl_gustafson.pdf
51. What is an advantage of multiprocessor architecture?
→ Correct Answer: It increases throughput
It increases throughput the most when programs are written to incorporate parallel processing, but
sometimes this can increase the size of the problem.
It is easier to increase clock rate — overclocking can be done only by raising what’s called the base
clock — that is, the clock speed of your motherboard and your entire system (including the CPU, RAM,
and PCI Express devices, such as sound cards or graphic cards). And just because you can overclock a
CPU doesn’t mean you should. If you push the technology too far, you can expect to encounter CPU
stability issues. [28] This has nothing to do with multiprocessors, it’s merely a state any processor
that supports it can have.
It minimizes instruction bandwidth — This would require changing the instruction set, not adding cores
to a CPU
It is easier to extend a scalar instruction set — Scalar processors represent a class of computer
processors. A scalar processor processes only one data item at a time, with typical data items
being integers or floating point numbers.[1] A scalar processor is classified as a single instruction, single
data (SISD) processor in Flynn’s taxonomy. The Intel 486 is an example of a scalar processor. [29] This
has nothing to do with multiprocessors.
52. What is uniform memory access?
Uniform memory access (UMA) is a shared memory architecture used in parallel computers. All the
processors in the UMA model share the physical memory uniformly. In an UMA architecture, access
time to a memory location is independent of which processor makes the request or which memory chip
contains the transferred data. Uniform memory access computer architectures are often contrasted with
non-uniform memory access architectures. In the UMA architecture each processor may use a private
cache. Peripherals are also shared in some fashion. The UMA model is suitable for general purpose and
time sharing applications by multiple users. It can be used to speed up the execution of a single large
program in time-critical applications.
The illustration shows the differences between a NUMA and UMA architecture. The NUMA layout has
local memory assigned to each processor, while in the UMA setup multiple processors share memory
via a bus. (Image via Database Technology Group) [4]
Non uniform memory access is more expensive to implement, and more complicated BUT it is easier to
scale, which is pretty nice.
53. In which type of multiprocessor is latency to any word in main memory the same regardless of
which processor requests access?
→ Uniform memory access
Locking memory access
Non-uniform memory access
Synchronization memory access
[5] San Diego State University

54. What benefit does pipelining a set of instructions provide?
→ The throughput of the complete set of instructions is increased.
What is a memory block?
A memory block, sometimes also called a line, is the smallest unit of data transferred between levels in
memory. When we move memory we nearly always move an entire block at a time.
[3] Chapter 6.1 Introduction to Memory Management
Above, a block is being moved closer to the CPU after being accessed. The principle of spatial locality
dictates that if this block was referenced once, it is likely that it will be referenced again.
60. A cache has 16 one-word blocks.
Memory blocks are mapped to fully associative caches. Memory block is 15. What is the cache
position given the cache configuration and memory block?
Block 7
Block 15
Any of the 15 cache blocks
→ Any of the 16 cache blocks
Cache — A small memory that keeps a copy of data from larger memory. Cache is built from SRAM.
The CPU can often find needed data in the small fast cache, whose access is much faster than the much
larger but much slower DRAM. [3]
Block (or line) — The minimum unit of information that can be either present or not present in a cache.
[3]
Direct-mapped cache — A cache structure in which each memory location is mapped to exactly one
location in the cache. [3]
The typical mapping between addresses and cache locations for a direct-mapped cache is usually simple.
For example, almost all direct-mapped caches use this mapping to find a block:
(BlockAddress)modulo(NumberOfBlocksInTheCache)
Tag: A field in a table used for a memory hierarchy that contains the address information required to
identify whether the associated block in the hierarchy corresponds to a requested word. [3]
Fully associative cache: A cache structure in which a block can be placed in any location in the cache.
[3]
So far, when we put a block in the cache, we have used a simple placement scheme: A block can go in
exactly one place in the cache. As mentioned earlier, it is called direct mapped because there is a direct
mapping from any block address in memory to a single location in the upper level of the hierarchy.
However, there is actually a whole range of schemes for placing blocks. Direct mapped, where a block
can be placed in exactly one location, is at one extreme. [3]
At the other extreme is a scheme where a block can be placed in any location in the cache. Such a scheme
is called fully associative, because a block in memory may be associated with any entry in the cache. To
find a given block in a fully associative cache, all the entries in the cache must be searched because a
block can be placed in any one. To make the search practical, it is done in parallel with a comparator
associated with each cache entry. These comparators significantly increase the hardware cost, effectively
making fully associative placement practical only for caches with small numbers of blocks. [3]
61. What is a hit? What is a miss?
Imagine you are in a library studying to write some paper. You grab a set of some books off the shelf
with the info you think you might need and put them on your table next to your laptop for easy access.
When you need a certain piece of data, hopefully it is in one of the books on your desk, but if not, you
will have to get up and go look on the shelves again, which is much slower. [3]
Memory hierarchy is somewhat like this, where data that is likely to be referenced again is placed nearest
the processor.
When the data being referenced happens to be in one of the upper levels near the processor, that is
a hit, and when the data needed is not in one of the upper levels near the processor, that is a miss.
Since performance is the major reason for having a memory hierarchy, the time to service hits and misses
is important. Hit time is the time to access the upper level of the memory hierarchy, which includes the
time needed to determine whether the access is a hit or a miss (that is, the time needed to look through the
books on the desk). The miss penalty is the time to replace a block in the upper level with the
corresponding block from the lower level, plus the time to deliver this block to the processor (or the time
to get another book from the shelves and place it on the desk). Because the upper level is smaller and
built using faster memory parts, the hit time will be much smaller than the time to access the next level in
the hierarchy, which is the major component of the miss penalty. (The time to examine the books on the
desk is much smaller than the time to get up and get a new book from the shelves.)
[5] from Wikipedia
62. What are temporal and spatial locality?
Programs exhibit both temporal locality, the tendency to reuse recently accessed data items, and spatial
locality, the tendency to reference data items that are close to other recently accessed items. Memory
hierarchies take advantage of temporal locality by keeping more recently accessed data items closer to the
processor. Memory hierarchies take advantage of spatial locality by moving blocks consisting of multiple
contiguous words in memory to upper levels of the hierarchy.
The figure below shows that a memory hierarchy uses smaller and faster memory technologies close to
the processor. Thus, accesses that hit in the highest level of the hierarchy can be processed quickly.
Accesses that miss go to lower levels of the hierarchy, which are larger but slower. If the hit rate is high
enough, the memory hierarchy has an effective access time close to that of the highest (and fastest) level
and a size equal to that of the lowest (and largest) level.
In most systems, the memory is a true hierarchy, meaning that data cannot be present in level i unless
they are also present in level i + 1. [3]
Calculate the fastest processor
Four processors (1, 2, 3, and 4) have clock frequencies of 200 Mhz, 300 Mhz, 500 Mhz, and 700
Mhz, respectively.
Suppose:
 Processor 1 can execute an instruction with an average of 5 steps.
Which processor should be selected to improve performance for the execution of the same instruction?
Which set-associative cache will improve overall performance?
Answer: Two-way Cache
Fully associative cache: A cache structure in which a block can be placed in any location in the cache.
Set-associative cache: A cache that has a fixed number of locations (at least two) where each block can
be placed.
The middle range of designs between direct mapped and fully associative is called set associative. In a
set-associative cache, there are a fixed number of locations where each block can be placed. A set-
associative cache with n locations for a block is called an n-way set-associative cache. An n-way set-
associative cache consists of a number of sets, each of which consists of n blocks. Each block in the
memory maps to a unique set in the cache given by the index field, and a block can be placed in any
element of that set. Thus, a set-associative placement combines direct-mapped placement and fully
associative placement: a block is directly mapped into a set, and then all the blocks in the set are searched
for a match.
73. Which technique should be implemented to reduce cache miss rate?
Answer: blocking
Easiest way to reduce miss rate is to increase cache block size, but only up to a certain point in relation in
to cache size.
Note that the miss rate actually goes up if the block size is too large relative to the cache size. Each line
represents a cache of different size. (This figure is independent of associativity, discussed soon.)
Unfortunately, SPEC CPU2000 traces would take too long if block size were included, so these data are
based on SPEC92. [5]
[5]
74. What happens on an instructor cache miss?
1. Send the original PC value to the memory.
2. Instruct main memory to perform a read and wait for the memory to complete its access.
3. Write the cache entry, putting the data from memory in the data portion of the entry, writing the
upper bits of the address (from the ALU) into the tag field, and turning the valid bit on.
4. Restart the instruction execution at the first step, which will refetch the instruction, this time finding
it in the cache.
Memory Hierarchy Diagram-

Level-0:

 At level-0, registers are present which are contained inside the CPU.
 Since they are present inside the CPU, they have least access time.
 They are most expensive and therefore smallest in size (in KB).
 Registers are implemented using Flip-Flops.

Level-1:

 At level-1, Cache Memory is present.
 It stores the segments of program that are frequently accessed by the processor.
 It is expensive and therefore smaller in size (in MB).
 Cache memory is implemented using static RAM.

Level-2:

 At level-2, main memory is present.
 It can communicate directly with the CPU and with auxiliary memory devices through an I/O
processor.
 It is less expensive than cache memory and therefore larger in size (in few GB).
 Main memory is implemented using dynamic RAM.

Level-3:

 At level-3, secondary storage devices like Magnetic Disk are present.
 They are used as back up storage.
 They are cheaper than main memory and therefore much larger in size (in few TB).

Level-4:

 At level-4, tertiary storage devices like magnetic tape are present.
 They are used to store removable files.
 They are cheapest and largest in size (1-20 TB).
Characteristics of Memory Hierarchy
One can infer these characteristics of a Memory Hierarchy Design from the figure given above:
1. Capacity
It refers to the total volume of data that a system’s memory can store. The capacity increases moving
from the top to the bottom in the Memory Hierarchy.
2. Access Time
It refers to the time interval present between the request for read/write and the data availability. The
access time increases as we move from the top to the bottom in the Memory Hierarchy.
3. Performance
When a computer system was designed earlier without the Memory Hierarchy Design, the gap in speed
increased between the given CPU registers and the Main Memory due to a large difference in the
system’s access time. It ultimately resulted in the system’s lower performance, and thus, enhancement
was required. Such a kind of enhancement was introduced in the form of Memory Hierarchy Design,
and because of this, the system’s performance increased. One of the primary ways to increase the
performance of a system is minimising how much a memory hierarchy has to be done to manipulate
data.
4. Cost per bit
The cost per bit increases as one moves from the bottom to the top in the Memory Hierarchy, i.e.
External Memory is cheaper than Internal Memory.
What is the Difference Between Memory Mapped IO and IO Mapped IO
The main difference between memory mapped IO and IO mapped IO is that the memory mapped IO
uses the same address space for both memory and IO device while the IO mapped IO uses two
separate address spaces for memory and IO device.
CPU uses two methods to perform input/output operations between the CPU and peripheral devices in
the computer. These two methods are called memory mapped IO and IO mapped IO. Memory-mapped
IO uses the same address space to address both memory and I/O devices. On the other hand, IO mapped
IO uses separate address spaces to address memory and IO devices.
What is Memory Mapped IO
Memory mapped IO uses one address space for memory and input and output devices. In other words,
some addresses are assigned to memory while others are assigned to store the addresses of IO devices.
There is one set of read and write instruction lines. The same set of instructions work for both memory
and IO operations. Therefore, the instructions used to manipulate memory can be used for IO devices
too. Hence, it can lessen the addressing capability of memory because some are occupied by the IO.
Figure 1: IO Devices and Memory in Computer
What is IO Mapped IO
IO mapped IO uses two separate address spaces for memory locations and for IO devices. There are two
separate control lines for both memory and IO transfer. In other words, there are different read-write
instruction for both IO and memory. IO read and IO write are for IO transfer whereas memory read and
memory write are for memory transfer. IO mapped IO is also called port-mapped IO or isolated IO.
Difference Between Memory Mapped IO and IO Mapped IO
Definition
Memory mapped IO is a method to perform input/output (I/O) operations between the central processing
unit (CPU) and peripheral devices in a computer that uses one address space for memory and IO
devices. IO mapped IO is a method to perform input/output (I/O) operations between the central
processing unit (CPU) and peripheral devices in a computer that uses two separate address spaces for
memory and IO devices. Thus, this definition explains the basis of the difference between memory
mapped IO and IO mapped IO.
Address Spaces
The main difference between memory mapped IO and IO mapped IO is that the memory mapped IO
uses the same address space for both memory and IO devices. IO mapped IO uses two separate address
spaces for memory and IO device.
Addresses for Memory
Branching from the above, there is another difference between memory mapped IO and IO mapped IO.
As the memory mapped IO uses one address space for both IO and memory, the available addresses for
memory are minimum due to the additional addresses for IO. In IO mapped IO, all the addresses can be
used by the memory.
Instructions
While memory mapped IO uses the same instructions for both IO and memory operations, IO mapped
IO uses separate instructions for read and write operations in IO and memory. We can say this as one
other difference between memory mapped IO and IO mapped IO.
Efficiency
Moreover, memory mapped IO is less efficient while IO mapped IO is more efficient.
Addressing Modes
The addressing modes help us specify the way in which an operand’s effective address is represented in
any given instruction. Some addressing modes allow referring to a large range of areas efficiently, like
some linear array of addresses along with a list of addresses. The addressing modes describe an efficient
and flexible way to define complex effective addresses.
The programs are generally written in high-level languages, as it’s a convenient way in which one can
define the variables along with the operations that a programmer performs on the variables. This
program is later compiled so as to generate the actual machine code. A machine code includes low-level
instructions.
A set of low-level instructions has operands and opcodes. An addressing mode has no relation with the
opcode part. It basically focuses on presenting the address of the operand in the instructions.
Addressing Modes Types
The addressing modes refer to how someone can address any given memory location. Five different
addressing modes or five ways exist using which this can be done.
1. Implied / Implicit Addressing Mode
2. Stack Addressing Mode
3. Immediate Addressing Mode
4. Direct Addressing Mode
5. Indirect Addressing Mode
6. Register Direct Addressing Mode
7. Register Indirect Addressing Mode
8. Relative Addressing Mode
9. Indexed Addressing Mode
10. Base Register Addressing Mode
11. Auto-Increment Addressing Mode
12. Auto-Decrement Addressing Mode
1. Implied Addressing Mode-

In this addressing mode,
 The definition of the instruction itself specify the operands implicitly.
 It is also called as implicit addressing mode.

Examples-

 The instruction “Complement Accumulator” is an implied mode instruction.
 In a stack organized computer, Zero Address Instructions are implied mode instructions.
(since operands are always implied to be present on the top of the stack)

2. Stack Addressing Mode-

 The operand is contained at the top of the stack.

Example-

ADD
 This instruction simply pops out two symbols contained at the top of the stack.
 The addition of those two operands is performed.
 The result so obtained after addition is pushed again at the top of the stack.

3. Immediate Addressing Mode-

 The operand is specified in the instruction explicitly.
 Instead of address field, an operand field is present that contains the operand.

Examples-

 ADD 10 will increment the value stored in the accumulator by 10.
 MOV R #20 initializes register R to a constant value 20.

4. Direct Addressing Mode-

 The address field of the instruction contains the effective address of the operand.
 Only one reference to memory is required to fetch the operand.
 It is also called as absolute addressing mode.

Example-

 ADD X will increment the value stored in the accumulator by the value stored at memory location X.
AC ← AC + [X]

5. Indirect Addressing Mode-

 The address field of the instruction specifies the address of memory location that contains the
effective address of the operand.
 Two references to memory are required to fetch the operand.

Example-

 ADD X will increment the value stored in the accumulator by the value stored at memory location
specified by X.
AC ← AC + [[X]]

6. Register Direct Addressing Mode-

 The operand is contained in a register set.
 The address field of the instruction refers to a CPU register that contains the operand.
 No reference to memory is required to fetch the operand.

Example-

 ADD R will increment the value stored in the accumulator by the content of register R.
AC ← AC + [R]

NOTE-

It is interesting to note-
 This addressing mode is similar to direct addressing mode.
 The only difference is address field of the instruction refers to a CPU register instead of main
memory.

7. Register Indirect Addressing Mode-

 The address field of the instruction refers to a CPU register that contains the effective address of the
operand.

Example-

 ADD R will increment the value stored in the accumulator by the content of memory location
specified in register R.
AC ← AC + [[R]]

NOTE-

It is interesting to note-
 This addressing mode is similar to indirect addressing mode.
 The only difference is address field of the instruction refers to a CPU register.

8. Relative Addressing Mode-

 Effective address of the operand is obtained by adding the content of program counter with the
address part of the instruction.

Effective Address
= Content of Program Counter + Address part of the instruction

NOTE-

 Program counter (PC) always contains the address of the next instruction to be executed.
 After fetching the address of the instruction, the value of program counter immediately increases.
 The value increases irrespective of whether the fetched instruction has completely executed or not.

9. Indexed Addressing Mode-

 Effective address of the operand is obtained by adding the content of index register with the address
part of the instruction.

Effective Address
= Content of Index Register + Address part of the instruction

10. Base Register Addressing Mode-

 Effective address of the operand is obtained by adding the content of base register with the address
part of the instruction.

Effective Address
= Content of Base Register + Address part of the instruction

11. Auto-Increment Addressing Mode-

 This addressing mode is a special case of Register Indirect Addressing Mode where-

Effective Address of the Operand
= Content of Register

 After accessing the operand, the content of the register is automatically incremented by step size ‘d’.
 Step size ‘d’ depends on the size of operand accessed.

Example-

Assume operand size = 2 bytes.
Here,
 After fetching the operand 6B, the instruction register RAUTO will be automatically incremented by 2.
 Then, updated value of RAUTO will be 3300 + 2 = 3302.
 At memory address 3302, the next operand will be found.

NOTE-

In auto-increment addressing mode,
 First, the operand value is fetched.
 Then, the instruction register RAUTO value is incremented by step size ‘d’.

12. Auto-Decrement Addressing Mode-

 This addressing mode is again a special case of Register Indirect Addressing Mode where-

Effective Address of the Operand
= Content of Register – Step Size

 First, the content of the register is decremented by step size ‘d’.
 Step size ‘d’ depends on the size of operand accessed.
 After decrementing, the operand is read.

Example-

Assume operand size = 2 bytes.
Here,
 First, the instruction register RAUTO will be decremented by 2.
 Then, updated value of RAUTO will be 3302 – 2 = 3300.
 At memory address 3300, the operand will be found.

NOTE-

In auto-decrement addressing mode,
 First, the instruction register RAUTO value is decremented by step size ‘d’.
 Then, the operand value is fetched.

Applications of Various Addressing Modes
The applications of various addressing modes are as follows:
 Immediate addressing mode: Used to set an initial value for a register. The value is usually a constant
 Register addressing mode/direct addressing mode: Used to implement variables and access static
data
 Register indirect addressing mode/indirect addressing mode: Used to pass an array as a parameter
and to implement pointers
 Relative addressing mode: Used to relocate programs at run time and to change the execution order of
instruction
 Index addressing mode: Used to implement arrays
 Base register addressing mode: Used to write codes that are relocatable and for handling recursion
 Auto-increment/decrement addressing mode: Used to implement loops and stacks
What is replacement in computer architecture?
A virtual memory organization is a consolidation of hardware and software systems. It can make
efficient utilization of memory space all the software operations are handled by the memory
management software.
The hardware mapping system and the memory management software together form the structure of
virtual memory.
When the program implementation starts, one or more pages are transferred into the main memory and
the page table is set to denote their location. The program is implemented from the main memory just
before a reference is created for a page that is not in memory. This event is defined as a page fault.
When a page fault appears, the program that is directly in execution is stopped just before the required
page is transferred into the main memory. Because the act of loading a page from auxiliary memory to
main memory is an I/O operation, the operating framework creates this function for the I/O processor.
In this interval, control is moved to the next program in the main memory that is waiting to be prepared
in the CPU. Soon after the memory block is assigned and then moved, the suspended program can
resume execution.
If the main memory is full, a new page cannot be moved in. Therefore, it is important to remove a page
from a memory block to hold the new page. The decision of removing specific pages from memory is
determined by the replacement algorithm.
There are two common replacement algorithms used are the first-in, first-out (FIFO) and least recently
used (LRU).
The FIFO algorithm chooses to replace the page that has been in memory for the highest time. Every
time a page is weighted into memory, its identification number is pushed into a FIFO stack.
FIFO will be complete whenever memory has no more null blocks. When a new page should be loaded,
the page least currently transports in is removed. The page to be removed is simply determined because
its identification number is at the high of the FIFO stack.
The FIFO replacement policy has the benefit of being simple to execute. It has the drawback that under
specific circumstances pages are removed and loaded from memory too frequently.
The LRU policy is more complex to execute but has been more interesting on the presumption that the
least recently used page is an excellent applicant for removal than the least recently loaded page as in
FIFO. The LRU algorithm can be executed by relating a counter with each page that is in the main
memory.
When a page is referenced, its associated counter is set to zero. At permanent intervals of time, the
counters related to all pages directly in memory are incremented by 1.
The least recently used page is the page with the largest count. The counters are known as aging
registers, as their count denotes their age, that is, how long ago their related pages have been referenced.
Types of cache replacement policies
In this article, we are going to discuss the types of cache replacement policies and also learn these
replacement policies with the help of a suitable example in computer organization.
In the direct-mapped cache, the position of each block is predetermined hence no replacement policy
exists.
 In fully associative and set-associative cache there exist policies.
 When a new block is brought into the cache and all the position that it many occurs are full, and
then the controller needs to decide which of the old blocks it can overwrite
First in First out policy
 The block which has entered first in the main be replaced first.
 This can lead to a problem known as "Belady's Anamoly", it starts that if we increase the no. of
lines in cache memory the cache miss will increase.
Belady's Anamoly: For some cache replacement algorithm, the pages fault or miss rate increase as the
number of allocated frame increase.
Example: Let we have a sequence 7, 0 ,1, 2, 0, 3, 0, 4, 2, 3, and cache memory has 4 lines.
Ther
e are a total of 6 misses in the FIFO replacement policy.
LRU (least recently used)
 The page which was not used for the largest period of time in the past will get reported first.
 We can think of this strategy as the optimal cache- replacement algorithm looking backward in
time, rather than forward.
 LRU is much better than FIFO replacement.
 LRU is also called a stack algorithm and can never exhibit belady's anamoly.
 The problem which is most important is how to implement LRU replacement. An LRU page
replacement algorithm may require a sustainable hardware resource.
Example: Let we have a sequence 7, 0 ,1, 2, 0, 3, 0, 4, 2, 3 and cache memory has 3 lines.
Th
ere are a total of 6 misses in the LRU replacement policy.
Optimal Approach
 The page which will not be used for the largest period of time in future reference will be replaced
first.
 The optimal algorithm will provide the best performance but this is difficult to implement as it
requires the future knowledge of pages which is not possible.
 It is used as a benchmark for the cache replacement algorithm.
 It is mainly used for comparison studies.
 The use of this cache replacement algorithm guarantees the lowest possible page fault rate for a
fixed rate of frames.
Example: Let we have a sequence 7, 0 ,1, 2, 0, 3, 0, 4, 2, 3 and cache memory has 4 lines.
There are a total of 6 misses in the optimal replacement policy.
Instruction cycle
The instruction cycle is the time required by the CPU to execute one single instruction. The instruction
cycle is the basic operation of the CPU which consists of three steps. The CPU respectively fetch,
decode, execute cycle to execute one program instruction. The machine cycle is part of the instruction
cycle. The computer system’s main function is to execute the program. The computer program consists
of et of instructions. The cpu is responsible to execute these program instructions.
The program instructions are stored into the main memory RAM. The computer memory is organize into
number of cells. Each cell has a specific memory addres. The processor initiates the program execution
by fetching the machine instructions one from the main memory RAM. The cpu executes these
instructions by repetitively performing sequence of four steps called instruction cycle. Each part of the
instruction cycle requires number of machine cycles to complete that part.
Instruction format
The computer program consists of number of instructions which directs the cpu to perform specific
operation. However, the cpu needs to know the details such as which operation is to be performed on
which data and the location of the data. This information is provided by the instruction format. The cpu
starts the program execution by fetching the program insstructions one by one from the main memory
RAM. The control unit of the cpu decodes the program instruction.

CS Note

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS Note

Uploaded by

Copyright:

Available Formats

What is Cache Mapping?

The Need for Replacement Algorithm

2. Fully Associative Mapping

3. K-way Set Associative Mapping

2. Which access does a register file rely on to properly function?

What is the value in X3 after the following command?

Which set of ARM instructions will accomplish A=B+C?

[27] AmDahl and Gustafsens

[5] San Diego State University

Memory Hierarchy Diagram-

You might also like