Professional Documents
Culture Documents
CS Note
CS Note
As we know that the cache memory bridges the mismatch of speed between the main memory and the
processor. Whenever a cache hit occurs,
The word that is required is present in the memory of the cache.
Then the required word would be delivered from the cache memory to the CPU.
And, whenever a cache miss occurs,
The word that is required isn’t present in the memory of the cache.
The page consists of the required word that we need to map from the main memory.
We can perform such a type of mapping using various different techniques of cache mapping.
Let us discuss different techniques of cache mapping in this article.
Process of Cache Mapping
The process of cache mapping helps us define how a certain block that is present in the main memory
gets mapped to the memory of a cache in the case of any cache miss.
In simpler words, cache mapping refers to a technique using which we bring the main memory into the
cache memory. Here is a diagram that illustrates the actual process of mapping:
Now, before we proceed ahead, it is very crucial that we note these points:
Important Note:
The main memory gets divided into multiple partitions of equal size, known as the frames or
blocks.
The cache memory is actually divided into various partitions of the same sizes as that of the
blocks, known as lines.
The main memory block is copied simply to the cache during the process of cache mapping, and
this block isn’t brought at all from the main memory.
Techniques of Cache Mapping
One can perform the process of cache mapping using these three techniques given as follows:
1. K-way Set Associative Mapping
2. Direct Mapping
3. Fully Associative Mapping
1. Direct Mapping
In the case of direct mapping, a certain block of the main memory would be able to map a cache only up
to a certain line of the cache. The total line numbers of cache to which any distinct block can map are
given by the following:
Cache line number = (Address of the Main Memory Block ) Modulo (Total number of lines in Cache)
For example,
Let us consider that particular cache memory is divided into a total of ‘n’ number of lines.
Then, the block ‘j’ of the main memory would be able to map to line number only of the cache (j
mod n).
To achieve supercomputer speeds, the 360/9x models pioneered new concepts such as instruction
pipelining and lookahead, branch prediction, cache memory, overlap, and parallelism.
But basically, the thing you need to remember here, is that pipelining as a tool is still ultimately in service
of paralellism. Paralellism is the major feature of the microprocessors developed in the 21st century.
Which two elements are required to implement R-Format arithmetic logic unit (ALU) operations?
ALU and register file.
The ALU actually does the operation, the register file specifies which registers hold the data to be
operated on and the register to write the result to.
R-Format Instruction: Reads two registers, performs an ALU operation on the contents of the registers,
and writes the result to a register. Also called “R-Type Instructions” or “Arithmetic-Logical Instructions”.
Examples include ADD, SUB, AND, and ORR.
Example:
ADD X1, X2, X3
The value in register X2 and and the value in register X3 are added together on the ALU and the sum
written into register X1.
Register File — A state element that consists of a set of registers that can be read and written by
supplying a register number to be accessed.
ORR — 10.69 ORR
Logical OR.
Syntax
ORR{S}{cond} Rd, Rn, Operand2
where:
S
is an optional suffix. If S is specified, the condition flags are updated on the result of the operation.
cond
is an optional condition code.
Rd
is the destination register.
Rn
is the register holding the first operand.
Operand2
is a flexible second operand.
Operation
The ORR instruction performs bitwise OR operations on the values in Rn and Operand2.
In certain circumstances, the assembler can substitute ORN for ORR, or ORR for ORN. Be aware of this
when reading disassembly listings.
10. What is an ALU?
Hardware that performs addition, subtraction, and usually logical operations such as AND and OR
What is a Datapath?
The component of the processor that performs arithmetic operations.
What is in a datapath?
[31] DataPath Wikipedia.
A datapath is a collection of functional units such as arithmetic logic units or multipliers that
perform data processing operations, registers, and buses. … A data path is the ALU, the set of
registers, and the CPU’s internal bus(es) that allow data to flow between them.
12. Which component of a computer moderates the action of its other components?
Control — the component of the processor that commands the datapath, memory, and I/O devices
according to the instructions of the program.
The control commands the datapath.
In computer architecture , the control unit is defined as an important component of the central processing
unit ( CPU ) that controls and directs all the operations of the computer system. The microprocessor
is considered to be the brain of the computer system. [32]
The processor internally consist of three functional units. The CPU funtional units include control unit
(CU), arithmetic and logic unit (ALU) and the memory unit (MU). [32]
[32]
13. Binary Addition
Given the following 8-bit integer binary variables:
X1 = 11000110
X2 = 11110111
RIGHT
18. What is meant by pipelining in computer architecture?
Pipelining is a technique where multiple instructions are overlapped in execution. It is a strategy that we
apply in order to create parallel programming implementations.
18. What is temporal locality?
The principle stating that if a data location is referenced then it will tend to be referenced again soon.
19. What is spatial locality?
The locality principle stating that if a data location is referenced, data locations with nearby addresses
will tend to be referenced soon.
20. What’s the principle of locality?
The principle of locality states that programs access a relatively small portion of their address space at
any instant of time.
21. What’s miss penalty?
The time required to fetch a block into a level of the memory hierarchy from the lower level, including
the time to access the block, transmit it from the one level to the other, insert it in the level that
experienced the miss, and then pass the block to the requestor.
Memory Hierarchy [3]
22. What does each bank of modern DRAMS consist of?
Rows
DRAM — Memory built as an integrated circuit; it provides random access to any location. Access times
are 50 nanoseconds and cost per gigabyte in 2012 was $5-$10. More dense (has more bits per unit area)
than SRAM, but not as fast. It is volatile memory that needs power to retain any data. It clears after every
reboot or loss of power. Main Memory, the memory used to hold programs while they are running,
typically consists of DRAM in todays computers.
The value kept in a cell in Dynamic RAM is stored as a charge in a capacitor. A single transistor is then
used to access this stored charge, either to read the value or to overwrite the charge stored there. Because
DRAMs use only one transistor per bit of storage, they are much denser and cheaper per bit than SRAM.
As DRAMs store the charge on a capacitor, it cannot be kept indefinitely and must be periodically be
refreshed. That is why this memory structure is called dynamic.
To refresh we just read the cell contents and then write it back. The charge keeps for a few milliseconds,
so that’s how long we have to write it back before we lose the data. It would take too long to do this for
each cell so DRAMs use a two-level decoding structure that lets us refresh an entire Row (a whole word
line) at a time.
The row organization that helps with refresh also helps with performance. To improve performance,
DRAMs buffer rows for repeated access. The buffer acts like an SRAM; by changing the address,
random bits can be accessed in the buffer until the next row access. This capability improves the access
time significantly, since the access time to bits in the row is much lower. Making the chip wider also
improves the memory bandwidth of the chip. When the row is in the buffer, it can be transferred by
successive addresses at whatever the width of the DRAM is (typically 4, 8, or 16 bits), or by specifying a
block transfer and the starting address within the buffer.
23. Is SRAM or DRAM used to implement memory levels closest to the processor?
SRAM- because SRAM is faster. SRAMs are typically faster, so are used to implement memory levels
closer to the processor. SRAM access times typically range from 0.5 to 2.5 ns, while DRAM access times
typically range from 50 to 70 ns.
24. Which has fewer transistors per bit of memory, SRAM or DRAM?
DRAM — that’s why it’s cheaper, transistors cost money, and not as close to the CPU, and slower.
The architectural difference between the two is that DRAM uses transistors and capacitors in an array of
repeating circuits (where each circuit is one bit), whereas SRAM uses several transistors in a circuit to
form one bit. [19]
25. Which type of RAM requires a periodic refresh, SRAM or DRAM?
DRAM — hence the word dynamic
26. What is superscalar as it relates to parallelization?
Denoting a computer architecture where several instructions are loaded at once and, as far as possible, are
executed simultaneously, shortening the time taken to run the whole program.
Superscalar describes a microprocessor design that makes it possible for more than one instruction at a
time to be executed during a single clock cycle .
27. Example ARM Instruction Problem:
The value of b is stored in r1, c is stored in r2, and a is stored in r0.
Which set of ARM instructions will accomplish a = b & c?
→ AND r0, r1, r2
OR r0, r1, r2
EOR r0, r1, r2
ORR r0, r1, r2
28. Example ARM Instruction Problem:
X1: A
X2: B
X3: C
39. Which register will be populated with the reason for an exception in LEGv8 architecture?
ESR — Exception Syndrome Register
The ESR register holds a field that indicates the reason for the exception.
40. What is a page table?
A page table is the data structure used by a virtual memory system in a computer operating system to
store the mapping between virtual addresses and physical addresses. Virtual addresses are used by the
program executed by the accessing process, while physical address are used by the hardware, or more
specifically by the RAM subsystem. The page table of virtual address translation which is necessary to
access data in memory.
41. What is a page table register?
A page table base register (PTBR) holds the base address for the page table of the current process. It is a
processor register that is managed by the operating system. To support this a processor that supports
virtual memory must have a page table base register that is accessible by the operating system.
[24] Virtual Memory Diagram https://www.usna.edu/Users/cs/crabbe/SI411/current/memory/vm.html
42. Which statement about the operating system describes how virtual memory is allocated in
ARM architecture?
→ It loads the page table register to refer to the page table of the process.
It uses a reference bit to refer to the page table of the process
It loads the entire page table to reference the process
It uses a limit register to refer to the page table of the process.
[25] Paging Systems. https://people.cs.rutgers.edu/~pxk/416/notes/10-paging.html
43. What is a virtual address space?
In computing a virtual address space (VAS) is the set of ranges of virtual addresses that an operating
system makes available to a process. The range of virtual addresses usually starts at a low address and
can extend to the highest address allowed by the computers instruction set architecture.
Virtual addresses need to be translated to physical addresses. Virtual addresses reference anywhere in
memory, not just in “virtual memory”
44. What is a translation-lookaside buffer?
Translation lookaside buffer (TLB) is a memory cache that is used to reduce the time taken to access a
user memory location. It is a part of the chip’s memory management unit. (MMU)
The TLB stores recent translations of virtual memory to physical memory and can be called an address-
translation cache. A TLB may reside between the CPU and the CPU cache, between CPU cache and the
main memory, or between the different levels of the multi-level cache. The majority of desktop, laptop
and server processors include one or more TLBs in the memory-management hardware, and it is nearly
always present in any processor that utilizes paged or segmented virtual memory.
44. What is used by virtual memory to increase performance?
→ Translation-lookaside buffer
Sparse Memory
Demand Paging
Page Size
This buffer basically is caching recently translated addresses so that they may not need to be translated
over and over again. Using the principle of temporality we can infer that a recently accessed address is
likely to be accessed again in the near future.
45. What maps virtual memory to real memory by using page tables?
→ Each guest operating system manages virtual memory independently
The host operating system manages each virtual machines virtual memory
The hypervisor software manages each virtual machine’s virtual memory
Guest operating systems prohibit the use of virtual memory
What is Amdahl’s Law?
In 1967, Amdahl pointed out that the speedup is limited by the fraction of the serial part of the software
that is not amenable to parallelization [1]. Amdahl’s law can be formulated as follows
speedup = 1 / (s + p / N)
where s is the proportion of execution time spent on the serial part, p is the proportion of execution time
spent on the part that can be parallelized, and N is the number of processors.
Amdahl’s law states that, for a fixed problem, the upper limit of speedup is determined by the serial
fraction of the code.
This is called strong scaling and can be explained by the following example.
Consider a program that takes 20 hours to run using a single processor core. If a particular part of the
program, which takes one hour to execute, cannot be parallelized (s = 1/20 = 0.05), and if the code that
takes up the remaining 19 hours of execution time can be parallelized (p = 1 − s = 0.95), then regardless
of how many processors are devoted to a parallelized execution of this program, the minimum execution
time cannot be less than that critical one hour. Hence, the theoretical speedup is limited to at most 20
times (when N = ∞, speedup = 1/s = 20). As such, the parallelization efficiency decreases as the amount
of resources increases. For this reason, parallel computing with many processors is useful only for highly
parallelized programs.
Weak Scaling is not bound my Amdahls law because it bound instead by Gustafson’s Law. [27]
[27] AmDahl and Gustafsens
Law. https://www.cse-lab.ethz.ch/wp-content/uploads/2018/11/amdahl_gustafson.pdf
The illustration shows the differences between a NUMA and UMA architecture. The NUMA layout has
local memory assigned to each processor, while in the UMA setup multiple processors share memory
via a bus. (Image via Database Technology Group) [4]
Non uniform memory access is more expensive to implement, and more complicated BUT it is easier to
scale, which is pretty nice.
53. In which type of multiprocessor is latency to any word in main memory the same regardless of
which processor requests access?
→ Uniform memory access
Locking memory access
Non-uniform memory access
Synchronization memory access
The figure below shows that a memory hierarchy uses smaller and faster memory technologies close to
the processor. Thus, accesses that hit in the highest level of the hierarchy can be processed quickly.
Accesses that miss go to lower levels of the hierarchy, which are larger but slower. If the hit rate is high
enough, the memory hierarchy has an effective access time close to that of the highest (and fastest) level
and a size equal to that of the lowest (and largest) level.
In most systems, the memory is a true hierarchy, meaning that data cannot be present in level i unless
they are also present in level i + 1. [3]
Calculate the fastest processor
Four processors (1, 2, 3, and 4) have clock frequencies of 200 Mhz, 300 Mhz, 500 Mhz, and 700
Mhz, respectively.
Suppose:
Processor 1 can execute an instruction with an average of 5 steps.
Processor 2 can execute an instruction with an average of 3 steps.
Processor 3 can execute an instruction with an average of 3 steps.
Processor 4 can execute an instruction with an average of 5 steps.
Which processor should be selected to improve performance for the execution of the same instruction?
Which set-associative cache will improve overall performance?
Answer: Two-way Cache
Fully associative cache: A cache structure in which a block can be placed in any location in the cache.
Set-associative cache: A cache that has a fixed number of locations (at least two) where each block can
be placed.
The middle range of designs between direct mapped and fully associative is called set associative. In a
set-associative cache, there are a fixed number of locations where each block can be placed. A set-
associative cache with n locations for a block is called an n-way set-associative cache. An n-way set-
associative cache consists of a number of sets, each of which consists of n blocks. Each block in the
memory maps to a unique set in the cache given by the index field, and a block can be placed in any
element of that set. Thus, a set-associative placement combines direct-mapped placement and fully
associative placement: a block is directly mapped into a set, and then all the blocks in the set are searched
for a match.
73. Which technique should be implemented to reduce cache miss rate?
Answer: blocking
Easiest way to reduce miss rate is to increase cache block size, but only up to a certain point in relation in
to cache size.
Note that the miss rate actually goes up if the block size is too large relative to the cache size. Each line
represents a cache of different size. (This figure is independent of associativity, discussed soon.)
Unfortunately, SPEC CPU2000 traces would take too long if block size were included, so these data are
based on SPEC92. [5]
[5]
74. What happens on an instructor cache miss?
1. Send the original PC value to the memory.
2. Instruct main memory to perform a read and wait for the memory to complete its access.
3. Write the cache entry, putting the data from memory in the data portion of the entry, writing the
upper bits of the address (from the ALU) into the tag field, and turning the valid bit on.
4. Restart the instruction execution at the first step, which will refetch the instruction, this time finding
it in the cache.
Examples-
ADD 10 will increment the value stored in the accumulator by 10.
MOV R #20 initializes register R to a constant value 20.
4. Direct Addressing Mode-
In this addressing mode,
The address field of the instruction contains the effective address of the operand.
Only one reference to memory is required to fetch the operand.
It is also called as absolute addressing mode.
Example-
ADD X will increment the value stored in the accumulator by the value stored at memory location X.
AC ← AC + [X]
5. Indirect Addressing Mode-
In this addressing mode,
The address field of the instruction specifies the address of memory location that contains the
effective address of the operand.
Two references to memory are required to fetch the operand.
Example-
ADD X will increment the value stored in the accumulator by the value stored at memory location
specified by X.
AC ← AC + [[X]]
6. Register Direct Addressing Mode-
In this addressing mode,
The operand is contained in a register set.
The address field of the instruction refers to a CPU register that contains the operand.
No reference to memory is required to fetch the operand.
Example-
ADD R will increment the value stored in the accumulator by the content of register R.
AC ← AC + [R]
NOTE-
It is interesting to note-
This addressing mode is similar to direct addressing mode.
The only difference is address field of the instruction refers to a CPU register instead of main
memory.
7. Register Indirect Addressing Mode-
In this addressing mode,
The address field of the instruction refers to a CPU register that contains the effective address of the
operand.
Only one reference to memory is required to fetch the operand.
Example-
ADD R will increment the value stored in the accumulator by the content of memory location
specified in register R.
AC ← AC + [[R]]
NOTE-
It is interesting to note-
This addressing mode is similar to indirect addressing mode.
The only difference is address field of the instruction refers to a CPU register.
8. Relative Addressing Mode-
In this addressing mode,
Effective address of the operand is obtained by adding the content of program counter with the
address part of the instruction.
Effective Address
= Content of Program Counter + Address part of the instruction
NOTE-
Program counter (PC) always contains the address of the next instruction to be executed.
After fetching the address of the instruction, the value of program counter immediately increases.
The value increases irrespective of whether the fetched instruction has completely executed or not.
9. Indexed Addressing Mode-
In this addressing mode,
Effective address of the operand is obtained by adding the content of index register with the address
part of the instruction.
Effective Address
= Content of Index Register + Address part of the instruction
10. Base Register Addressing Mode-
In this addressing mode,
Effective address of the operand is obtained by adding the content of base register with the address
part of the instruction.
Effective Address
= Content of Base Register + Address part of the instruction
11. Auto-Increment Addressing Mode-
This addressing mode is a special case of Register Indirect Addressing Mode where-
Effective Address of the Operand
= Content of Register
In this addressing mode,
After accessing the operand, the content of the register is automatically incremented by step size ‘d’.
Step size ‘d’ depends on the size of operand accessed.
Only one reference to memory is required to fetch the operand.
Example-
Assume operand size = 2 bytes.
Here,
After fetching the operand 6B, the instruction register RAUTO will be automatically incremented by 2.
Then, updated value of RAUTO will be 3300 + 2 = 3302.
At memory address 3302, the next operand will be found.
NOTE-
In auto-increment addressing mode,
First, the operand value is fetched.
Then, the instruction register RAUTO value is incremented by step size ‘d’.
12. Auto-Decrement Addressing Mode-
This addressing mode is again a special case of Register Indirect Addressing Mode where-
Effective Address of the Operand
= Content of Register – Step Size
In this addressing mode,
First, the content of the register is decremented by step size ‘d’.
Step size ‘d’ depends on the size of operand accessed.
After decrementing, the operand is read.
Only one reference to memory is required to fetch the operand.
Example-
Assume operand size = 2 bytes.
Here,
First, the instruction register RAUTO will be decremented by 2.
Then, updated value of RAUTO will be 3302 – 2 = 3300.
At memory address 3300, the operand will be found.
NOTE-
In auto-decrement addressing mode,
First, the instruction register RAUTO value is decremented by step size ‘d’.
Then, the operand value is fetched.
Applications of Various Addressing Modes
The applications of various addressing modes are as follows:
Immediate addressing mode: Used to set an initial value for a register. The value is usually a constant
Register addressing mode/direct addressing mode: Used to implement variables and access static
data
Register indirect addressing mode/indirect addressing mode: Used to pass an array as a parameter
and to implement pointers
Relative addressing mode: Used to relocate programs at run time and to change the execution order of
instruction
Index addressing mode: Used to implement arrays
Base register addressing mode: Used to write codes that are relocatable and for handling recursion
Auto-increment/decrement addressing mode: Used to implement loops and stacks
What is replacement in computer architecture?
A virtual memory organization is a consolidation of hardware and software systems. It can make
efficient utilization of memory space all the software operations are handled by the memory
management software.
The hardware mapping system and the memory management software together form the structure of
virtual memory.
When the program implementation starts, one or more pages are transferred into the main memory and
the page table is set to denote their location. The program is implemented from the main memory just
before a reference is created for a page that is not in memory. This event is defined as a page fault.
When a page fault appears, the program that is directly in execution is stopped just before the required
page is transferred into the main memory. Because the act of loading a page from auxiliary memory to
main memory is an I/O operation, the operating framework creates this function for the I/O processor.
In this interval, control is moved to the next program in the main memory that is waiting to be prepared
in the CPU. Soon after the memory block is assigned and then moved, the suspended program can
resume execution.
If the main memory is full, a new page cannot be moved in. Therefore, it is important to remove a page
from a memory block to hold the new page. The decision of removing specific pages from memory is
determined by the replacement algorithm.
There are two common replacement algorithms used are the first-in, first-out (FIFO) and least recently
used (LRU).
The FIFO algorithm chooses to replace the page that has been in memory for the highest time. Every
time a page is weighted into memory, its identification number is pushed into a FIFO stack.
FIFO will be complete whenever memory has no more null blocks. When a new page should be loaded,
the page least currently transports in is removed. The page to be removed is simply determined because
its identification number is at the high of the FIFO stack.
The FIFO replacement policy has the benefit of being simple to execute. It has the drawback that under
specific circumstances pages are removed and loaded from memory too frequently.
The LRU policy is more complex to execute but has been more interesting on the presumption that the
least recently used page is an excellent applicant for removal than the least recently loaded page as in
FIFO. The LRU algorithm can be executed by relating a counter with each page that is in the main
memory.
When a page is referenced, its associated counter is set to zero. At permanent intervals of time, the
counters related to all pages directly in memory are incremented by 1.
The least recently used page is the page with the largest count. The counters are known as aging
registers, as their count denotes their age, that is, how long ago their related pages have been referenced.
Types of cache replacement policies
In this article, we are going to discuss the types of cache replacement policies and also learn these
replacement policies with the help of a suitable example in computer organization.
In the direct-mapped cache, the position of each block is predetermined hence no replacement policy
exists.
In fully associative and set-associative cache there exist policies.
When a new block is brought into the cache and all the position that it many occurs are full, and
then the controller needs to decide which of the old blocks it can overwrite
First in First out policy
The block which has entered first in the main be replaced first.
This can lead to a problem known as "Belady's Anamoly", it starts that if we increase the no. of
lines in cache memory the cache miss will increase.
Belady's Anamoly: For some cache replacement algorithm, the pages fault or miss rate increase as the
number of allocated frame increase.
Example: Let we have a sequence 7, 0 ,1, 2, 0, 3, 0, 4, 2, 3, and cache memory has 4 lines.
Ther
e are a total of 6 misses in the FIFO replacement policy.
LRU (least recently used)
The page which was not used for the largest period of time in the past will get reported first.
We can think of this strategy as the optimal cache- replacement algorithm looking backward in
time, rather than forward.
LRU is much better than FIFO replacement.
LRU is also called a stack algorithm and can never exhibit belady's anamoly.
The problem which is most important is how to implement LRU replacement. An LRU page
replacement algorithm may require a sustainable hardware resource.
Example: Let we have a sequence 7, 0 ,1, 2, 0, 3, 0, 4, 2, 3 and cache memory has 3 lines.
Th
ere are a total of 6 misses in the LRU replacement policy.
Optimal Approach
The page which will not be used for the largest period of time in future reference will be replaced
first.
The optimal algorithm will provide the best performance but this is difficult to implement as it
requires the future knowledge of pages which is not possible.
It is used as a benchmark for the cache replacement algorithm.
It is mainly used for comparison studies.
The use of this cache replacement algorithm guarantees the lowest possible page fault rate for a
fixed rate of frames.
Example: Let we have a sequence 7, 0 ,1, 2, 0, 3, 0, 4, 2, 3 and cache memory has 4 lines.
There are a total of 6 misses in the optimal replacement policy.
Instruction cycle
The instruction cycle is the time required by the CPU to execute one single instruction. The instruction
cycle is the basic operation of the CPU which consists of three steps. The CPU respectively fetch,
decode, execute cycle to execute one program instruction. The machine cycle is part of the instruction
cycle. The computer system’s main function is to execute the program. The computer program consists
of et of instructions. The cpu is responsible to execute these program instructions.
The program instructions are stored into the main memory RAM. The computer memory is organize into
number of cells. Each cell has a specific memory addres. The processor initiates the program execution
by fetching the machine instructions one from the main memory RAM. The cpu executes these
instructions by repetitively performing sequence of four steps called instruction cycle. Each part of the
instruction cycle requires number of machine cycles to complete that part.
Instruction format
The computer program consists of number of instructions which directs the cpu to perform specific
operation. However, the cpu needs to know the details such as which operation is to be performed on
which data and the location of the data. This information is provided by the instruction format. The cpu
starts the program execution by fetching the program insstructions one by one from the main memory
RAM. The control unit of the cpu decodes the program instruction.