21ec52 Co Arm m2 Notes
21ec52 Co Arm m2 Notes
21ec52 Co Arm m2 Notes
Prepared by
Dr. Nataraju A B
Assistant Professor, Department of ECE
Acharya Institute of Technology, Bangalore
Ideally, the memory would be fast, large, and inexpensive. Unfortunately, it is impossible to
meet all three of these requirements simultaneously. Increased speed and size are achieved at increased
cost. Much work has gone into developing structures that improve the effective speed and size of the
memory, yet keep the cost reasonable.
The maximum size of the Main Memory (MM) that can be used in any computer is determined
by its addressing scheme. For example, a 16-bit computer that generates 16-bit addresses is capable of
addressing upto 2^16 =64K memory locations. If a machine generates 32-bit addresses, it can access
upto 2^32 = 4G memory locations. This number represents the size of address space of the computer.
If the smallest addressable unit of information is a memory word, the machine is called word-
addressable. If individual memory bytes are assigned distinct addresses, the computer is called byte-
addressable. Most of the commercial machines are byte-addressable. For example in a byte-addressable
32-bit computer, each memory word contains 4 bytes. A possible word-address assignment would be:
Word Address Byte Address
0 0123
4 4567
8 8 9 10 11
. …..
. …..
. …..
With the above structure a READ or WRITE may involve an entire memory word or it may involve
only a byte. In the case of byte read, other bytes can also be read but ignored by the CPU. However,
during a write cycle, the control circuitry of the MM must ensure that only the specified byte is altered.
In this case, the higher-order 30 bits can specify the word and the lower-order 2 bits can specify the
byte within the word.
Cache Memory:-
The CPU of a computer can usually process instructions and data faster than they can be fetched from
compatibly priced main memory unit. Thus the memory cycle time become the bottleneck in the system.
One way to reduce the memory access time is to use cache memory. This is a small and fast memory
that is inserted between the larger, slower main memory and the CPU. This holds the currently active
segments of a program and its data. Because of the locality of address references, the CPU can, most of
the time, find the relevant information in the cache memory itself (cache hit) and infrequently needs
access to the main memory (cache miss) with suitable size of the cache memory, cache hit rates of over
90% are possible leading to a cost-effective increase in the performance of the system.
Memory Interleaving: -
Virtual Memory: -
In a virtual memory System, the address generated by the CPU is referred to as a virtual or logical
address. The corresponding physical address can be different and the required mapping is implemented
by a special memory control unit, often called the memory management unit. The mapping function
itself may be changed during program execution according to system requirements.
Because of the distinction made between the logical (virtual) address space and the physical address
space; while the former can be as large as the addressing capability of the CPU, the actual physical
memory can be much smaller. Only the active portion of the virtual address space is mapped onto the
physical memory and the rest of the virtual address space is mapped onto the bulk storage device used.
If the addressed information is in the Main Memory (MM), it is accessed and execution proceeds.
Otherwise, an exception is generated, in response to which the memory management unit transfers a
contiguous block of words containing the desired word from the bulk storage unit to the MM,
displacing some block that is currently inactive. If the memory is managed in such a way that, such
transfers are required relatively infrequency (ie the CPU will generally find the required information in
the MM), the virtual memory system can provide a reasonably good performance and succeed in
creating an illusion of a large memory with a small, in expensive MM.
The following figure shows such an organization of a memory chip consisting of 16 words of 8 bits each,
which is usually referred to as a 16 x 8 organization.
The data input and the data output of each Sense/Write circuit are connected to a single bi-directional data
line in order to reduce the number of pins required. One control line, the R/W (Read/Write) input is used a
specify the required operation and another control line, the CS (Chip Select) input is used to select a given
chip in a multichip memory system. This circuit requires 14 external connections, and allowing 2 pins for
power supply and ground connections, can be manufactured in the form of a 16-pin chip. It can store 16 x 8
= 128 bits.
Figure 8.2 is an example of a very small memory circuit consisting of 16 words of 8 bits each. This is
referred to as a 16 × 8 organization. The data input and the data output of each Sense/Write circuit are
connected to a single bidirectional data line that can be connected to the data lines of a computer. Two
control lines, R/𝑊̅ and CS, are provided. The R/W (Read/𝑊𝑟𝑖𝑡𝑒
̅̅̅̅̅̅̅̅) input specifies the required operation,
and the CS (Chip Select) input selects a given chip in a multichip memory system.
The memory circuit in Figure 8.2 stores 128 bits and requires 14 external connections for address, data,
and control lines. It also needs two lines for power supply and ground connections. Consider now a
slightly larger memory circuit, one that has 1K (1024) memory cells. This circuit can be organized as a
128 × 8 memory, requiring a total of 19 external connections. Alternatively, the same number of cells
can be organized into a 1K×1 format. In this case, a 10-bit address is needed, but there is only one data
line, resulting in 15 external connections.
The 10-bit address is divided into two groups of 5 bits each to form the row and column addresses for
the cell array. A row address selects a row of 32 cells, all of which are accessed in parallel. One of
these, selected by the column address, is connected to the external data lines by the input and output
multiplexers. This structure can store 1024 bits, can be implemented in a 16-pin chip.
Read Operation:-
Let us assume the Q1 on and Q2 off represents a 1 to read the contents of a given cell, the voltage on
the corresponding word line is reduced from 2.5 V to approximately 0.3 V. This causes one of the
diodes D1 or D2 to become forward-biased, depending on whether the transistor Q1 or Q2 is
conducting. As a result, current flows from bit line b when the cell is in the 1 state and from bit line b
when the cell is in the 0 state. The Sense/Write circuit at the end of each pair of bit lines monitors the
current on lines b and b’ and sets the output bit line accordingly.
Write Operation: -
While a given row of bits is selected, that is, while the voltage on the corresponding word line is 0.3V,
the cells can be individually forced to either the 1 state by applying a positive voltage of about 3V to
line b’ or to the 0 state by driving line b. This function is performed by the Sense/Write circuit.
Dynamic Memories:-
The basic idea of dynamic memory is that information is stored in the form of a charge on the capacitor.
An example of a dynamic memory cell is shown below: When the transistor T is turned on and an
appropriate voltage is applied to the bit line, information is stored in the cell, in the form of a known
amount of charge stored on the capacitor.
After the transistor is turned off, the capacitor begins to discharge. This is caused by the capacitor’s
own leakage resistance and the very small amount of current that still flows through the transistor.
Hence the data is read correctly only if is read before the charge on the capacitor drops below some
threshold value. During a Read operation, the bit line is placed in a high-impedance state, the transistor
is turned on and a sense circuit connected to the bit line is used to determine whether the charge on the
capacitor is above or below the threshold value. During such a Read, the charge on the capacitor is
restored to its original value and thus the cell is refreshed with every read operation.
A typical organization of a 64k x 1 dynamic memory chip is shown below: The cells are organized in the
form of a square array such that the high-and lower-order 8 bits of the 16-bit address constitute the row and
column addresses of a cell, respectively. In order to reduce the number of pins needed for external
connections, the row and column address are multiplexed on 8 pins. To access a cell, the row address is
applied first. It is loaded into the row address latch in response to a single pulse on the Row Address Strobe
(RAS) input. This selects a row of cells. Now, the column address is applied to the address pins and is
loaded into the column address latch under the control of the Column Address Strobe (CAS) input and this
address selects the appropriate sense/write circuit. If the R/W signal indicates a Read operation, the output
of the selected circuit is transferred to the data output. Do. For a write operation, the data on the DI line is
used to overwrite the cell selected.
Another feature available on many dynamic memory chips is that once the row address is loaded, successive
locations can be accessed by loading only column addresses. Such block transfers can be carried out
typically at a rate that is double that for transfers involving random addresses. Such a feature is useful when
memory access follows a regular pattern, for example, in a graphics terminal. Because of their high density
and low cost, dynamic memories are widely used in the main memory units of computers. Commercially
available chips range in size from 1k to 4M bits or more, and are available in various organizations like 64k
x 1, 16k x 4, 1MB x 1 etc.
A memory is called a read-only memory, or ROM, when information can be written into it only once at the time of
manufacture. Figure 8.11 shows a possible configuration for a ROM cell. A logic value 0 is stored in the cell if the transistor
is connected to ground at point P; otherwise, a 1 is stored. The bit line is connected through a resistor to the power supply. To
read the state of the cell, the word line is activated to close the transistor switch. As a result, the voltage on the bit line drops
to near zero if there is a connection between the transistor and ground. If there is no connection to ground, the bit line
remains at the high voltage level, indicating a 1. A sense circuit at the end of the bit line generates the proper output value.
The state of the connection to ground in each cell is determined when the chip is manufactured, using a mask with a pattern
that represents the information to be stored.
PROM
Some ROM designs allow the data to be loaded by the user, thus providing a programmable ROM
(PROM). Programmability is achieved by inserting a fuse at point P in Figure 8.11. Before it is
programmed, the memory contains all 0s. The user can insert 1s at the required locations by burning out
the fuses at these locations using high-current pulses. Of course, this process is irreversible.
PROMs provide flexibility and convenience not available with ROMs. The cost of preparing the masks
needed for storing a particular information pattern makes ROMs cost effective only in large volumes.
The alternative technology of PROMs provides a more convenient and considerably less expensive
approach, because memory chips can be programmed directly by the user.
EPROM
Another type of ROM chip provides an even higher level of convenience. It allows the stored data to be
erased and new data to be written into it. Such an erasable, reprogrammable ROM is usually called an
EPROM. It provides considerable flexibility during the development phase of digital systems. Since
EPROMs are capable of retaining stored information for a long time, they can be used in place of ROMs
or PROMs while software is being developed. In this way, memory changes and updates can be easily
made.
An EPROM cell has a structure similar to the ROM cell in Figure 8.11. However, the connection to
ground at point P is made through a special transistor. The transistor is normally turned off, creating an
EEPROM
An EPROM must be physically removed from the circuit for reprogramming. Also, the stored
information cannot be erased selectively. The entire contents of the chip are erased when exposed to
ultraviolet light. Another type of erasable PROM can be programmed, erased, and reprogrammed
electrically. Such a chip is called an electrically erasable PROM, or EEPROM. It does not have to be
removed for erasure. Moreover, it is possible to erase the cell contents selectively. One disadvantage of
EEPROMs is that different voltages are needed for erasing, writing, and reading the stored data, which
increases circuit complexity. However, this disadvantage is outweighed by the many advantages of
EEPROMs. They have replaced EPROMs in practice.
Although dynamic memory units with gigabyte capacities can be implemented at a reasonable cost, the
affordable size is still small compared to the demands of large programs with voluminous data. A
solution is provided by using secondary storage, mainly magnetic disks, to provide the required memory
space. Disks are available at a reasonable cost, and they are used extensively in computer systems.
However, they are much slower than semiconductor memory units. In summary, a very large amount of
cost-effective storage can be provided by magnetic disks, and a large and considerably faster, yet
affordable, main memory can be built with dynamic RAM technology. This leaves the more expensive
and much faster static RAM technology to be used in smaller units where speed is of the essence, such
as in cache memories.
All of these different types of memory units are employed effectively in a computer system. The entire
computer memory can be viewed as the hierarchy depicted in Figure 8.14. The fastest access is to data
held in processor registers. Therefore, if we consider the registers to be part of the memory hierarchy,
then the processor registers are at the top in terms of speed of access. Of course, the registers provide
only a minuscule portion of the required memory.
At the next level of the hierarchy is a relatively small amount of memory that can be implemented
directly on the processor chip. This memory, called a processor cache, holds copies of the instructions
and data stored in a much larger memory that is provided externally. There are often two or more levels
of cache. A primary cache is always located on the processor chip. This cache is small and its access
time is comparable to that of processor registers. The primary cache is referred to as the level 1 (L1)
cache. A larger, and hence somewhat slower, secondary cache is placed between the primary cache and
the rest of the memory. It is referred to as the level 2 (L2) cache. Often, the L2 cache is also housed on
the processor chip.
Some computers have a level 3 (L3) cache of even larger size, in addition to the L1 and L2 caches. An
L3 cache, also implemented in SRAM technology, may or may not be on the same chip with the
processor and the L1 and L2 caches.
The next level in the hierarchy is the main memory. This is a large memory implemented using dynamic
memory components, typically assembled in memory modules such as DIMMs, as described in Section
8.2.5. The main memory is much larger but significantly slower than cache memories. In a computer
with a processor clock of 2 GHz or higher, the access time for the main memory can be as much as 100
times longer than the access time for the L1 cache.
Disk devices provide a very large amount of inexpensive memory, and they are widely used as
secondary storage in computer systems. They are very slow compared to the main memory. They
represent the bottom level in the memory hierarchy.
During program execution, the speed of memory access is of utmost importance. The key to managing
the operation of the hierarchical memory system in Figure 8.14 is to bring the instructions and data that
are about to be used as close to the processor as possible. This is the main purpose of using cache
memories, which we discuss next.
Conceptually, operation of a cache memory is very simple. The memory control circuitry is designed to take
advantage of the property of locality of reference. Temporal locality suggests that whenever an information item, instruction
or data, is first needed, this item should be brought into the cache, because it is likely to be needed again soon. Spatial
locality suggests that instead of fetching just one item from the main memory to the cache, it is useful to fetch several items
that are located at adjacent addresses as well. The term cache block refers to a set of contiguous address locations of some
size. Another term that is often used to refer to a cache block is a cache line.
Consider the arrangement in Figure 8.15. When the processor issues a Read request, the contents of a block of
memory words containing the location specified are transferred into the cache. Subsequently, when the program references
any of the locations in this block, the desired contents are read directly from the cache. Usually, the cache memory can store a
reasonable number of blocks at any given time, but this number is small compared to the total number of blocks in the main
memory. The correspondence between the main memory blocks and those in the cache is specified by a mapping function.
When the cache is full and a memory word (instruction or data) that is not in the cache is referenced, the cache control
hardware must decide which block should be removed to create space for the new block that contains the referenced word.
The collection of rules for making this decision constitutes the cache’s replacement algorithm.
Cache Hits
The processor does not need to know explicitly about the existence of the cache. It simply issues Read and Write
requests using addresses that refer to locations in the memory. The cache control circuitry determines whether the requested
word currently exists in the cache. If it does, the Read or Write operation is performed on the appropriate cache location. In
this case, a read or write hit is said to have occurred. The main memory is not involved when there is a cache hit in a Read
operation. For a Write operation, the system can proceed in one of two ways. In the first technique, called the write-through
protocol, both the cache location and the main memory location are updated. The second technique is to update only the
cache location and to mark the block containing it with an associated flag bit, often called the dirty or modified bit. The main
memory location of the word is updated later, when the block containing this marked word is removed from the cache to
make room for a new block. This technique is known as the write-back, or copy-back, protocol.
Cache Misses
A Read operation for a word that is not in the cache constitutes a Read miss. It causes the block of words containing
the requested word to be copied from the main memory into the cache. After the entire block is loaded into the cache, the
particular word requested is forwarded to the processor. Alternatively, this word may be sent to the processor as soon as it is
read from the main memory. The latter approach, which is called load-through, or early restart, reduces the processor’s
waiting time somewhat, at the expense of more complex circuitry.
When a Write miss occurs in a computer that uses the write-through protocol, the information is written directly into
the main memory. For the write-back protocol, the block containing the addressed word is first brought into the cache, and
then the desired word in the cache is overwritten with the new information.
Recall from Section 6.7 that resource limitations in a pipelined processor can cause instruction execution to stall for one or
more cycles. This can occur if a Load or Store instruction requests access to data in the memory at the same time that a
subsequent instruction is being fetched. When this happens, instruction fetch is delayed until the data access operation is
completed. To avoid stalling the pipeline, many processors use separate caches for instructions and data, making it possible
for the two operations to proceed in parallel.
There are several possible methods for determining where memory blocks are placed in the cache. It is instructive to
describe these methods using a specific small example. Consider a cache consisting of 128 blocks of 16 words each, for a
total of 2048 (2K) words, and assume that the main memory is addressable by a 16-bit address. The main memory has 64K
words, which we will view as 4K blocks of 16 words each. For simplicity, we have assumed that consecutive addresses refer
to consecutive words.
Direct Mapping
The simplest way to determine cache locations in which to store memory blocks is the direct-mapping technique. In
this technique, block j of the main memory maps onto block j modulo 128 of the cache, as depicted in Figure 8.16. Thus,
whenever one of the main memory blocks 0, 128, 256, . . . is loaded into the cache, it is stored in cache block 0. Blocks 1,
129, 257, . . . are stored in cache block 1, and so on. Since more than one memory block is mapped onto a given cache block
position, contention may arise for that position even when the cache is not full. For example, instructions of a program may
start in block 1 and continue in block 129, possibly after a branch. As this program is executed, both of these blocks must be
transferred to the block-1 position in the cache. Contention is resolved by allowing the new block to overwrite the currently
resident block.
With direct mapping, the replacement algorithm is trivial. Placement of a block in the cache is determined by its
memory address. The memory address can be divided into three fields, as shown in Figure 8.16. The low-order 4 bits select
one of 16 words in a block. When a new block enters the cache, the 7-bit cache block field determines the cache position in
which this block must be stored. The high-order 5 bits of the memory address of the block are stored in 5 tag bits associated
with its location in the cache. The tag bits identify which of the 32 main memory blocks mapped into this cache position is
currently resident in the cache. As execution proceeds, the 7-bit cache block field of each address generated by the processor
points to a particular block location in the cache. The high-order 5 bits of the address are compared with the tag bits
associated with that cache location. If they match, then the desired word is in that block of the cache. If there is no match,
then the block containing the required word must first be read from the main memory and loaded into the cache. The direct-
mapping technique is easy to implement, but it is not very flexible.
Associative Mapping
Figure 8.17 shows the most flexible mapping method, in which a main memory block can be placed into any cache
block position. In this case, 12 tag bits are required to identify a memory block when it is resident in the cache. The tag bits
of an address received from the processor are compared to the tag bits of each block of the cache to see if the desired block is
present. This is called the associative-mapping technique. It gives complete freedom in choosing the cache location in which
to place the memory block, resulting in a more efficient use of the space in the cache. When a new block is brought into the
cache, it replaces (ejects) an existing block only if the cache is full. In this case, we need an algorithm to select the block to
be replaced. Many replacement algorithms are possible, as we discuss in Section 8.6.2. The complexity of an associative
cache is higher than that of a direct-mapped cache, because of the need to search all 128 tag patterns to determine whether a
given block is in the cache. To avoid a long delay, the tags must be searched in parallel. A search of this kind is called an
associative search.
Set-Associative Mapping
Another approach is to use a combination of the direct- and associative-mapping techniques. The
blocks of the cache are grouped into sets, and the mapping allows a block of the main memory to reside
in any block of a specific set. Hence, the contention problem of the direct method is eased by having a
few choices for block placement. At the same time, the hardware cost is reduced by decreasing the size
of the associative search. An example of this set-associative-mapping technique is shown in Figure 8.18
for a cache with two blocks per set. In this case, memory blocks 0, 64, 128, . . . , 4032 map into cache
set 0, and they can occupy either of the two block positions within this set. Having 64 sets means that
the 6-bit set field of the address determines which set of the cache might contain the desired block. The
tag field of the address must then be associatively compared to the tags of the two blocks of the set to
check if the desired block is present. This two-way associative search is simple to implement.
The number of blocks per set is a parameter that can be selected to suit the requirements of a
particular computer. For the main memory and cache sizes in Figure 8.18, four blocks per set can be
accommodated by a 5-bit set field, eight blocks per set by a 4-bit set field, and so on. The extreme
condition of 128 blocks per set requires no set bits and corresponds to the fully-associative technique,
with 12 tag bits. The other extreme of one block per set is the direct-mapping method. A cache that has k
blocks per set is referred to as a k-way set-associative cache.
Performance Considerations.
Two key factors in the commercial success of a computer are performance and cost; the best
possible performance for a given cost is the objective. A common measure of success is the
price/performance ratio. Performance depends on how fast machine instructions can be brought into the
processor and how fast they can be executed. Chapter 6 shows how pipelining increases the speed of
program execution. In this chapter, we focus on the memory subsystem.
The memory hierarchy described in Section 8.5 results from the quest for the best
price/performance ratio. The main purpose of this hierarchy is to create a memory that the processor
sees as having a short access time and a large capacity. When a cache is used, the processor is able to
access instructions and data more quickly when the data from the referenced memory locations are in the
cache. Therefore, the extent to which caches improve performance is dependent on how frequently the
requested instructions and data are found in the cache. In this section, we examine this issue
quantitatively.
Performance is adversely affected by the actions that need to be taken when a miss occurs. A
performance penalty is incurred because of the extra time needed to bring a block of data from a slower
unit in the memory hierarchy to a faster unit. During that period, the processor is stalled waiting for
instructions or data. The waiting time depends on the details of the operation of the cache. For example,
it depends on whether or not the load-through approach is used. We refer to the total access time seen by
the processor when a miss occurs as the miss penalty.
Consider a system with only one level of cache. In this case, the miss penalty consists almost
entirely of the time to access a block of data in the main memory. Let h be the hit rate, M the miss
penalty, and C the time to access information in the cache. Thus, the average access time experienced by
the processor is
tavg = hC + (1 − h)M
When an instruction is fetched, it is placed in the instruction register, IR, from where it is interpreted, or
decoded, by the processor’s control circuitry. The IR holds the instruction until its execution is
completed.
Consider a 32-bit computer in which each instruction is contained in one word in the memory, as in
RISC-style instruction set architecture. To execute an instruction, the processor has to perform the
following steps:
1. Fetch the contents of the memory location pointed to by the PC. The contents of this location are
the instruction to be executed; hence they are loaded into the IR. In register transfer notation, the
required action is
IR←[[PC]]
2. Increment the PC to point to the next instruction. Assuming that the memory is byte addressable,
the PC is incremented by 4; that is
PC←[PC] + 4
3. Carry out the operation specified by the instruction in the IR.
Fetching an instruction and loading it into the IR is usually referred to as the instruction fetch phase.
Performing the operation specified in the instruction constitutes the instruction execution phase.
With few exceptions, the operation specified by an instruction can be carried out by performing one or
more of the following actions:
• Read the contents of a given memory location and load them into a processor register.
• Read data from one or more processor registers.
• Perform an arithmetic or logic operation and place the result into a processor register.
• Store data from a processor register into a given memory location.
Suppose that we wish to transfer the contents of register R1 to register R4. This can be accomplished as
follows:
Enable the input of register R4 by setting R4in, to 1. This loads data from the processor bus into register
R4.
All operations and data transfers within the processor take place within time periods. defined by the
processor clock. The control signals that govern a particular transfer are asserted at the start of the clock
cycle. In our example, Rlout and R4in, are set to 1. The registers consist of edge-triggered flip-flops.
Hence, at the next active edge of the clock, the flip-flops that constitute R4 will load the data present at
their inputs. At the same time, the control signals R1out and R4in, will return to 0. We will use this
simple model of the timing of data transfers for the rest of this chapter. However, we should point out
that other schemes are possible. For example, data transfers may use both the rising and falling edges of
the clock. Also, when edge-triggered flip-flops are not used, two or more clock signals may be needed to
guarantee proper transfer of data. This is known as multiphase clocking.
An implementation for one bit of register Ri is shown in Figure 7.3 as an example. A two-input
multiplexer is used to select the data applied to the input of an edge-triggered D flip-flop. When the
control input Ri, is equal to 1, the multiplexer selects the data on the bus. This data will be loaded into
the flip-flop at the rising edge of the clock. When Rim is equal to 0, the multiplexer feeds back the value
currently stored in the flip-flop.
The Q output of the flip-flop is connected to the bus via a tri-state gate. When Riout is equal to 0, the
gate's output is in the high-impedance (electrically disconnected) state. This corresponds to the open-
circuit state of a switch. When Riout 1, the gate drives the bus to 0 or 1, depending on the value of Q.
1. Rlout, Yin
3. Zout R3in
The signals whose names are given in any step are activated for the duration of the clock cycle
corresponding to that step. All other signals are inactive. Hence, in step 1, the output of register R1 and
the input of register Y are enabled, causing the contents of R1 to be transferred over the bus to Y. In step
2, the multiplexer's Select signal is set to SelectY, causing the multiplexer to gate the contents of register
Y to input A of the ALU. At the same time, the contents of register R2 are gated onto the bus and,
hence, to input B. The function performed by the ALU depends on the signals applied to its control
lines. In this case, the Add line is set to 1, causing the output of the ALU to be the sum of the two
numbers at inputs A and B. This sum is loaded into register Z because its input control signal is
activated. In step 3, the contents of register Z are transferred to the destination register, R3. This last
transfer cannot be carried out during step 2, because only one register output can be connected to the bus
during any clock cycle.
In this introductory discussion, we assume that there is a dedicated signal for each function to be
performed. For example, we assume that there are separate control signals to specify individual ALU
operations, such as Add, Subtract, XOR, and so on. In reality, some degree of encoding is likely to be
used. For example, if the ALU can perform eight different operations, three control signals would
suffice to specify the required operation.
The connections for register MDR are illustrated in Figure 7.4. It has four control signals: MDRin and
MDRout control the connection to the internal bus, and MDRinE and MDRoutE control the connection to
the external bus. The circuit in Figure 7.3 is easily modified to provide the additional connections. A
three-input multiplexer can be used, with the memory bus data line connected to the third input. This
input is selected when MDRinE = 1. A second tri-state gate, controlled by MDRourE can be used to
connect the output of the flip-flop to the memory bus.
During memory Read and Write operations, the timing of internal processor operations must be
coordinated with the response of the addressed device on the memory bus. The processor completes one
internal data transfer in one clock cycle. The speed of operation of the addressed device, on the other
Such I/O registers are not cached, so their accesses always take a number of clock cycles.
To accommodate the variability in response time, the processor waits until it receives an indication
that the requested Read operation has been completed. We will assume that a control signal called
Memory-Function-Completed (MFC) is used for this purpose. The addressed device sets this signal to
1 to indicate that the contents of the specified location have been read and are available on the data
lines of the memory bus. (We encountered several examples of such a signal in conjunction with the
buses discussed in Chapter 4, such as Slave-ready in Figure 4.25 and TRDY# in Figure 4.41.) As an
example of a read operation, consider the instruction Move (R1),R2. The actions needed to execute
this instruction are:
1. MAR← [R1]
2. Start a Read operation on the memory bus
3. Wait for the MFC response from the memory
4. Load MDR from the memory bus
5. R2 ← [MDR]
These actions may be carried out as separate steps, but some can be combined into a single step. Each
action can be completed in one clock cycle, except action 3 which requires one or more clock cycles,
depending on the speed of the addressed device.
For simplicity, let us assume that the output of MAR is enabled all the time. Thus, the contents of
MAR are always available on the address lines of the memory bus. This is the case when the processor
is the bus master. When a new address is loaded into MAR, it will appear on the memory bus at the
beginning of the next clock cycle, as shown in Figure 7.5. A Read control signal is activated at the
same time MAR is loaded. This signal will cause the bus interface circuit to send a read command,
MR, on the bus. With this arrangement, we have combined actions 1 and 2 above into a single control
step. Actions 3 and 4 can also be combined by activating control signal MDR ing while waiting for a
response from the memory. Thus the data received from the memory are loaded into MDR at the end
1. R1out , MARin
3. MDRoutE, WMFC
Add (R3), R1
which adds the contents of a memory location pointed to by R3 to register R1. Executing this instruction
requires the following actions:
Figure 7.6 gives the sequence of control steps required to perform these openstions for the single-bus
architecture of Figure 7.1. Instruction execution proceeds as follows. In step 1, the instruction fetch
operation is initiated by loading the contents of the PC into the MAR and sending a Read request to the
memory. The Select signal is set to Select4, which causes the multiplexer MUX to select the constant 4.
This value is added to the operand at input B, which is the contents of the PC, and the result is stored in
register Z. The updated value is moved from register Z back into the PC during step 2. while waiting for
the memory to respond at step 3, the word fetched from the memory is loaded into the IR.
Steps 1 through 3 constitute the instruction fetch phase, which is the same for all instructions. The
instruction decoding circuit interprets the contents of the IR at the beginning of step 4. This enables the
control circuitry to activate the control signals for steps 4 through 7, which constitute the execution
phase. The contents of register R3 are transferred to the MAR in step 4, and a memory read operation is
initiated. Then the contents of R1 are transferred to register Y in step 5, to prepare for the addition
operation. When the Read operation is completed, the memory operand is available in register MDR,
and the addition operation is performed in step 6. The contents of MDR are gated to the bus, and thus
also to the B input of the ALU, and register Y is selected as the second input to the ALU by choosing
SelectY. The sum is stored in register Z, then transferred to R1 in step 7. The End signal causes a new
instruction fetch cycle to begin by returning to step 1.
This discussion accounts for all control signals in Figure 7.6 except Y, in step 2. There is no need to
copy the updated contents of PC into register Y when executing the Add instruction. But, in Branch
instructions the updated value of the PC is needed to compute the Branch target address. To speed up the
execution of Branch instructions, this value is copied into register Y in step 2. Since step 2 is part of the
Step Action
1 PCout, MARin, Read, Select4, Add, Zin
2 Zout , PCin , Yin , WMFC
3 MDRout , IRin
4 R3out, MARin, Read
5 Rlout , Yin , WMFC
6 MDRout, SelectY, Add, Zin
7 Zout, Rlin, End
Figure 7.6 Control sequence for execution of the instruction Add (R3),R1.
Figure 7.8 depicts a three-bus structure used to connect the registers and the ALU of a processor. All
general-purpose registers are combined into a single block called the register file. In VLSI technology,
the most efficient way to implement a number of registers is in the form of an array of memory cells
similar to those used in the implementation of random-access memories (RAMs) described in Chapter 5.
The register file in Figure 7.8 is said to have three ports. There are two outputs, allowing the contents of
two different registers to be accessed simultaneously and have their contents placed on buses A and B.
The third port allows the data on bus C to be loaded into a third register during the same clock cycle.
Buses A and B are used to transfer the source operands to the A and B inputs of the ALU, where an
arithmetic or logic operation may be performed. The result is transferred to the destination over bus C. If
needed, the ALU may simply pass one of its two input operands unmodified to bus C. We will call the
ALU control signals for such an operation R-A or R-B. The three-bus arrangement obviates the need for
registers Y and Z in Figure 7.1.
A second feature in Figure 7.8 is the introduction of the Incrementor unit, which is used to increment the
PC by 4. Using the Incrementor eliminates the need to add 4 to the PC using the main ALU, as was done
in Figures 7.6 and 7.7. The source for the constant 4 at the ALU input multiplexer is still useful. It can
be used to increment other addresses, such as the memory addresses in Load Multiple (LDM) and Store
Multiple (STM) instructions.
Add R4,R5,R6
The control sequence for executing this instruction is given in Figure 7.9. In step 1, the contents of the
PC are passed through the ALU, using the R-B control signal, and loaded into the MAR to start a
memory read operation. At the same time the PC is incremented by 4. Note that the value loaded into
MAR is the original contents of the PC. The incremented value is loaded into the PC at the end of the
clock cycle and will not affect the contents of MAR. In step 2, the processor waits for MFC and loads
the data received into MDR, then transfers them to IR in step 3. Finally, the execution phase of the
instruction requires only one control step to complete, step 4.
By providing more paths for data transfer a significant reduction in the number of clock cycles needed to
execute an instruction is achieved.
Hard-wired Control,
To execute instructions, the processor must have some means of generating the con- trol signals needed
in the proper sequence. Computer designers use a wide variety of techniques to solve this problem. The
approaches used fall into one of two categories: hardwired control and microprogrammed control. We
discuss each of these techniques in detail, starting with hardwired control in this section.
Consider the sequence of control signals given in Figure 7.6. Each step in this sequence is completed in
one clock period. A counter may be used to keep track of the control steps, as shown in Figure 7.10.
Each state, or count, of this counter corresponds to one control step. The required control signals are
determined by the following information:
To gain insight into the structure of the control unit, we start with a simplified view of the hardware
involved. The decoder/encoder block in Figure 7.10 is a combinational circuit that generates the required
control outputs, depending on the state of all its inputs. By separating the decoding and encoding
functions, we obtain the more detailed block diagram in Figure 7.11. The step decoder provides a
separate signal line for each step, or time slot, in the control sequence. Similarly, the output of the
instruction decoder consists of a separate line for each machine instruction. For any instruction loaded in
the IR, one of the output lines INS1 through INSm is set to 1, and all other lines are set to 0. (For design
details of decoders, refer to Appendix A.) The input signals to the encoder block in Figure 7.11 are
combined to generate the individual control signals Yin, PCout, Add, End, and so on. An example of
how the encoder generates the Zin control signal for the processor organization in Figure 7.1 is given in
Figure 7.12. This circuit implements the logic function
This signal is asserted during time slot T₁ for all instructions, during To for an Add instruction, during
T4 for an unconditional branch instruction, and so on. The logic function for Zin is derived from the
control sequences in Figures 7.6 and 7.7.
First, we introduce some common terms. A control word (CW) is a word whose individual bits represent
the various control signals in Figure 7.11. Each of the control steps in the control sequence of an
instruction defines a unique combination of 1s and 0s in the CW. The CWs corresponding to the 7 steps
of Figure 7.6 are shown in Figure 7.15. We have assumed that SelectY is represented by Select = 0 and
Select4 by Select = 1. A sequence of CWs corresponding to the control sequence of a machine
The microroutines for all instructions in the instruction set of a computer are stored in a special memory
called the control store. The control unit can generate the control signals for any instruction by
sequentially reading the CWs of the corresponding microroutine from the control store. This suggests
organizing the control unit as shown in Figure 7.16. To read the control words sequentially from the
control store, a micro- program counter (µPC) is used. Every time a new instruction is loaded into the
IR, the output of the block labeled "starting address generator" is loaded into the MPC. The PC is then
automatically incremented by the clock, causing successive microinstructions to be read from the control
store. Hence, the control signals are delivered to various parts of the processor in the correct
In Section 7.4, we saw how the control signals required inside the processor
can be generated using a control step counter and a decoder/encoder circuit.
Now we discuss an alternative scheme, called microprogrammed control, in
which control signals are generated by a program similar to machine language
programs.
Consider how the idea of pipelining can be used in a computer. The processor executes a program by
fetching and executing instructions, one after the other. Let Fi and Ei refer to the fetch and execute steps
for instruction Ii. Execution of a program consists of a sequence of fetch and execute steps, as shown in
Figure 8.1a.
Now consider a computer that has two separate hardware units, one for fetching instructions and another
for executing them, as shown in Figure 8.16. The instruction fetched by the fetch unit is deposited in an
intermediate storage buffer, B1. This buffer is needed to enable the execution unit to execute the
instruction while the fetch unit is fetching the next instruction. The results of execution are deposited in
the destination location specified by the instruction. For the purposes of this discussion, we assume that
The computer is controlled by a clock whose period is such that the fetch and execute steps of any
instruction can each be completed in one clock cycle. Operation of the computer proceeds as in Figure
8.1c. In the first clock cycle, the fetch unit fetches an instruction I, (step F₁) and stores it in buffer B1 at
the end of the clock cycle. In the second clock cycle, the instruction fetch unit proceeds with the fetch
operation for instruction I₂ (step F₂). Meanwhile, the execution unit performs the operation specified by
instruction I₁, which is available to it in buffer B1 (step E₁). By the end of the second clock cycle, the
execution of instruction I1 is completed and instruction I₂ is available. Instruction I₂ is stored in B1,
replacing I₁, which is no longer needed. Step Е2 is performed by the execution unit during the third clock
cycle, while instruction I3 is being fetched by the fetch unit. In this manner, both the fetch and execute
units are kept busy all the time. If the pattern in Figure 8.1c can be sustained for a long time, the
completion rate of instruction execution will be twice that achievable by the sequential operation
depicted in Figure 8.1a.
In summary, the fetch and execute units in Figure 8.1b constitute a two-stage pipeline in which each
stage performs one step in processing an instruction. An inter- stage storage buffer, B1, is needed to hold
the information being passed from one stage to the next. New information is loaded into this buffer at
the end of each clock cycle. The processing of an instruction need not be divided into only two steps.
For example, a pipelined processor may process each instruction in four steps, as follows:
The sequence of events for this case is shown in Figure 8.2a. Four instructions are in progress at any
given time. This means that four distinct hardware units are needed, as shown in Figure 8.2b. These
units must be capable of performing their tasks simultaneously and without interfering with one another.
Information is passed from one unit to the next through a storage buffer. As an instruction progresses
through the pipeline, all the information needed by the stages downstream must be passed along. For
example, during clock cycle 4, the information in the buffers is as follows:
Buffer B1 holds instruction I3, which was fetched in cycle 3 and is being decoded by the instruction-
decoding unit.
The speed of execution of programs is influenced by many factors. One way to improve performance is
to use faster circuit technology to implement the processor and the main memory. Another possibility is
to arrange the hardware so that more than one operation can be performed at the same time. In this way,
the number of operations performed per second is increased, even though the time needed to perform
any one operation is not changed.
Consider how the idea of pipelining can be used in a computer. The five-stage processor organization
and the corresponding data path allow instructions to be fetched and executed one at a time. It takes five
clock cycles to complete the execution of each instruction. Rather than wait until each instruction is
completed, instructions can be fetched and executed in a pipelined manner, as shown in Figure 6.1.
The five stages are labelled as Fetch, Decode, Compute, Memory, and Write. Instruction Ij is fetched in
the first cycle and moves through the remaining stages in the following cycles. In the second cycle,
instruction Ij+1 is fetched while instruction Ij is in the Decode stage where its operands are also read
from the register file. In the third cycle, instruction Ij+2 is fetched while instruction Ij+1 is in the
Decode stage and instruction Ij is in the Compute stage where an arithmetic or logic operation is
performed on its operands. Ideally, this overlapping pattern of execution would be possible for all
instructions. Although any one instruction takes five cycles to complete its execution, instructions are
completed at the rate of one per cycle.