Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

21ec52 Co Arm m2 Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

ARM Embedded Systems Module-2 (21EC52)

Computer Organization & ARM Micro Controller


(21EC52)

Prepared by
Dr. Nataraju A B
Assistant Professor, Department of ECE
Acharya Institute of Technology, Bangalore

Module-2, Memory System: Basic Concepts, Semiconductor RAM Memories,


Read Only Memories, Speed, Size, and Cost, Cache Memories – Mapping
Functions, Replacement Algorithms, Performance Considerations.
Text book 1: Chapter 5 – 5.1 to 5.4, 5.5 (5.5.1, 5.5.2), 5.6
Basic Processing Unit: Some Fundamental Concepts, Execution of a Complete
Instruction, Multiple Bus Organization, Hard-wired Control, Micro programmed
Control. Basic concepts of pipelining,
Text book 1: Chapter7, Chapter 8 – 8.1

Dept. of ECE, AIT 1 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)

Memory System: Basic Concepts,


Programs and the data they operate on are held in the memory of the computer. In this chapter,
we discuss how this vital part of the computer operates. By now, the reader appreciates that the
execution speed of programs is highly dependent on the speed with which instructions and data can be
transferred between the processor and the memory. It is also important to have sufficient memory to
facilitate execution of large programs having large amounts of data.

Ideally, the memory would be fast, large, and inexpensive. Unfortunately, it is impossible to
meet all three of these requirements simultaneously. Increased speed and size are achieved at increased
cost. Much work has gone into developing structures that improve the effective speed and size of the
memory, yet keep the cost reasonable.
The maximum size of the Main Memory (MM) that can be used in any computer is determined
by its addressing scheme. For example, a 16-bit computer that generates 16-bit addresses is capable of
addressing upto 2^16 =64K memory locations. If a machine generates 32-bit addresses, it can access
upto 2^32 = 4G memory locations. This number represents the size of address space of the computer.

If the smallest addressable unit of information is a memory word, the machine is called word-
addressable. If individual memory bytes are assigned distinct addresses, the computer is called byte-
addressable. Most of the commercial machines are byte-addressable. For example in a byte-addressable
32-bit computer, each memory word contains 4 bytes. A possible word-address assignment would be:
Word Address Byte Address
0 0123
4 4567
8 8 9 10 11
. …..
. …..
. …..
With the above structure a READ or WRITE may involve an entire memory word or it may involve
only a byte. In the case of byte read, other bytes can also be read but ignored by the CPU. However,
during a write cycle, the control circuitry of the MM must ensure that only the specified byte is altered.
In this case, the higher-order 30 bits can specify the word and the lower-order 2 bits can specify the
byte within the word.

CPU-Main Memory Connection – A block schematic:


From the system standpoint, the Main Memory (MM) unit can be viewed as a “block box”. Data
transfer between CPU and MM takes place through the use of two CPU registers, usually called MAR
(Memory Address Register) and MDR (Memory Data Register). If MAR is K bits long and MDR is ‘n’
bits long, then the MM unit may contain upto 2k addressable locations and each location will be ‘n’ bits
wide, while the word length is equal to ‘n’ bits. During a “memory cycle”, n bits of data may be
transferred between the MM and CPU. This transfer takes place over the processor bus, which has k
address lines (address bus), n data lines (data bus) and control lines like Read, Write, Memory Function
completed (MFC), Bytes specifiers etc (control bus). For a read operation, the CPU loads the address
into MAR, set READ to 1 and sets other control signals if required. The data from the MM is loaded
into MDR and MFC is set to 1. For a write operation, MAR, MDR are suitably loaded by the CPU,

Dept. of ECE, AIT 2 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
write is set to 1 and other control signals are set suitably. The MM control circuitry loads the data into
appropriate locations and sets MFC to 1. This organization is shown in the following block schematic

Some Basic Concepts

Memory Access Times: -


It is a useful measure of the speed of the memory unit. It is the time that elapses between the initiation of
an operation and the completion of that operation (for example, the time between READ and MFC).

Memory Cycle Time :-


It is an important measure of the memory system. It is the minimum time delay required between the
initiations of two successive memory operations (for example, the time between two successive READ
operations). The cycle time is usually slightly longer than the access time.

Semiconductor RAM Memories


A memory unit is called a Random Access Memory if any location can be accessed for a READ or
WRITE operation in some fixed amount of time that is independent of the location’s address. Main
memory units are of this type. This distinguishes them from serial or partly serial access storage devices
such as magnetic tapes and disks which are used as the secondary storage device.

Cache Memory:-
The CPU of a computer can usually process instructions and data faster than they can be fetched from
compatibly priced main memory unit. Thus the memory cycle time become the bottleneck in the system.
One way to reduce the memory access time is to use cache memory. This is a small and fast memory
that is inserted between the larger, slower main memory and the CPU. This holds the currently active
segments of a program and its data. Because of the locality of address references, the CPU can, most of
the time, find the relevant information in the cache memory itself (cache hit) and infrequently needs
access to the main memory (cache miss) with suitable size of the cache memory, cache hit rates of over
90% are possible leading to a cost-effective increase in the performance of the system.

Memory Interleaving: -

Dept. of ECE, AIT 3 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
This technique divides the memory system into a number of memory modules and arranges addressing
so that successive words in the address space are placed in different modules. When requests for
memory access involve consecutive addresses, the access will be to different modules. Since parallel
access to these modules is possible, the average rate of fetching words from the Main Memory can be
increased.

Virtual Memory: -
In a virtual memory System, the address generated by the CPU is referred to as a virtual or logical
address. The corresponding physical address can be different and the required mapping is implemented
by a special memory control unit, often called the memory management unit. The mapping function
itself may be changed during program execution according to system requirements.

Because of the distinction made between the logical (virtual) address space and the physical address
space; while the former can be as large as the addressing capability of the CPU, the actual physical
memory can be much smaller. Only the active portion of the virtual address space is mapped onto the
physical memory and the rest of the virtual address space is mapped onto the bulk storage device used.
If the addressed information is in the Main Memory (MM), it is accessed and execution proceeds.
Otherwise, an exception is generated, in response to which the memory management unit transfers a
contiguous block of words containing the desired word from the bulk storage unit to the MM,
displacing some block that is currently inactive. If the memory is managed in such a way that, such
transfers are required relatively infrequency (ie the CPU will generally find the required information in
the MM), the virtual memory system can provide a reasonably good performance and succeed in
creating an illusion of a large memory with a small, in expensive MM.

Internal Organization of Memory Chips


Memory chips are usually organized in the form of an array of cells, in which each cell is capable of storing
one bit of information. A row of cells constitutes a memory word, and the cells of a row are connected to a
common line referred to as the word line, and this line is driven by the address decoder on the chip. The
cells in each column are connected to a sense/write circuit by two lines known as bit lines. The sense/write
circuits are connected to the data input/output lines of the chip. During a READ operation, the Sense/Write
circuits sense, or read, the information stored in the cells selected by a word line and transmit this
information to the output lines. During a write operation, they receive input information and store it in the
cells of the selected word.

The following figure shows such an organization of a memory chip consisting of 16 words of 8 bits each,
which is usually referred to as a 16 x 8 organization.

The data input and the data output of each Sense/Write circuit are connected to a single bi-directional data
line in order to reduce the number of pins required. One control line, the R/W (Read/Write) input is used a
specify the required operation and another control line, the CS (Chip Select) input is used to select a given
chip in a multichip memory system. This circuit requires 14 external connections, and allowing 2 pins for
power supply and ground connections, can be manufactured in the form of a 16-pin chip. It can store 16 x 8
= 128 bits.

Dept. of ECE, AIT 4 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)

Figure 8.2 is an example of a very small memory circuit consisting of 16 words of 8 bits each. This is
referred to as a 16 × 8 organization. The data input and the data output of each Sense/Write circuit are
connected to a single bidirectional data line that can be connected to the data lines of a computer. Two
control lines, R/𝑊̅ and CS, are provided. The R/W (Read/𝑊𝑟𝑖𝑡𝑒
̅̅̅̅̅̅̅̅) input specifies the required operation,
and the CS (Chip Select) input selects a given chip in a multichip memory system.

The memory circuit in Figure 8.2 stores 128 bits and requires 14 external connections for address, data,
and control lines. It also needs two lines for power supply and ground connections. Consider now a
slightly larger memory circuit, one that has 1K (1024) memory cells. This circuit can be organized as a
128 × 8 memory, requiring a total of 19 external connections. Alternatively, the same number of cells
can be organized into a 1K×1 format. In this case, a 10-bit address is needed, but there is only one data
line, resulting in 15 external connections.

Another type of organization for 1k x 1 format is shown below:

Dept. of ECE, AIT 5 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)

The 10-bit address is divided into two groups of 5 bits each to form the row and column addresses for
the cell array. A row address selects a row of 32 cells, all of which are accessed in parallel. One of
these, selected by the column address, is connected to the external data lines by the input and output
multiplexers. This structure can store 1024 bits, can be implemented in a 16-pin chip.

A Typical Memory Cell


Semiconductor memories may be divided into bipolar and MOS types. They may be compared as
follows:
Characteristic Bipolar MOS
Power Dissipation More Less
Bit Density Less More
Impedance Lower Higher
Speed More Less

Bipolar Memory Cell


A typical bipolar storage cell is shown below:
Two transistor inverters connected to implement a basic flip-flop. The cell is connected to one word line
and two bits lines as shown. Normally, the bit lines are kept at about 1.6V, and the word line is kept at a
slightly higher voltage of about 2.5V. Under these conditions, the two diodes D1 and D2 are reverse
biased. Thus, because no current flows through the diodes, the cell is isolated from the bit lines.

Dept. of ECE, AIT 6 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)

Read Operation:-
Let us assume the Q1 on and Q2 off represents a 1 to read the contents of a given cell, the voltage on
the corresponding word line is reduced from 2.5 V to approximately 0.3 V. This causes one of the
diodes D1 or D2 to become forward-biased, depending on whether the transistor Q1 or Q2 is
conducting. As a result, current flows from bit line b when the cell is in the 1 state and from bit line b
when the cell is in the 0 state. The Sense/Write circuit at the end of each pair of bit lines monitors the
current on lines b and b’ and sets the output bit line accordingly.

Write Operation: -
While a given row of bits is selected, that is, while the voltage on the corresponding word line is 0.3V,
the cells can be individually forced to either the 1 state by applying a positive voltage of about 3V to
line b’ or to the 0 state by driving line b. This function is performed by the Sense/Write circuit.

MOS Memory Cell: -


MOS technology is used extensively in Main Memory Units. As in the case of bipolar memories, many
MOS cell configurations are possible. The simplest of these is a flip-flop circuit. Two transistors T1 and
T2 are connected to implement a flip-flop. Active pull-up to VCC is provided through T3 and T4.
Transistors T5 and T6 act as switches that can be opened or closed under control of the word line. For a
read operation, when the cell is selected, T5 or T6 is closed and the corresponding flow of current
through b or b’ is sensed by the sense/write circuits to set the output bit line accordingly. For a write
operation, the bit is selected, and a positive voltage is applied on the appropriate bit line, to store a 0 or
1. This configuration is shown below:

Dept. of ECE, AIT 7 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
Fig . CMOS based SRAM cell

Static Memories Vs Dynamic Memories:-


Bipolar as well as MOS memory cells using a flip-flop like structure to store information can maintain
the information as long as current flow to the cell is maintained. Such memories are called static
memories. In contracts, Dynamic memories require not only the maintaining of a power supply, but also
a periodic “refresh” to maintain the information stored in them. Dynamic memories can have very high
bit densities and very lower power consumption relative to static memories and are thus generally used
to realize the main memory unit.

Dynamic Memories:-
The basic idea of dynamic memory is that information is stored in the form of a charge on the capacitor.
An example of a dynamic memory cell is shown below: When the transistor T is turned on and an
appropriate voltage is applied to the bit line, information is stored in the cell, in the form of a known
amount of charge stored on the capacitor.

Fig . MOSFET based DRAM cell

After the transistor is turned off, the capacitor begins to discharge. This is caused by the capacitor’s
own leakage resistance and the very small amount of current that still flows through the transistor.
Hence the data is read correctly only if is read before the charge on the capacitor drops below some
threshold value. During a Read operation, the bit line is placed in a high-impedance state, the transistor
is turned on and a sense circuit connected to the bit line is used to determine whether the charge on the
capacitor is above or below the threshold value. During such a Read, the charge on the capacitor is
restored to its original value and thus the cell is refreshed with every read operation.

Typical Organization of a Dynamic Memory Chip:-

A typical organization of a 64k x 1 dynamic memory chip is shown below: The cells are organized in the
form of a square array such that the high-and lower-order 8 bits of the 16-bit address constitute the row and
column addresses of a cell, respectively. In order to reduce the number of pins needed for external
connections, the row and column address are multiplexed on 8 pins. To access a cell, the row address is
applied first. It is loaded into the row address latch in response to a single pulse on the Row Address Strobe
(RAS) input. This selects a row of cells. Now, the column address is applied to the address pins and is
loaded into the column address latch under the control of the Column Address Strobe (CAS) input and this
address selects the appropriate sense/write circuit. If the R/W signal indicates a Read operation, the output
of the selected circuit is transferred to the data output. Do. For a write operation, the data on the DI line is
used to overwrite the cell selected.

Dept. of ECE, AIT 8 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
It is important to note that the application of a row address causes all the cells on the corresponding row to
be read and refreshed during both Read and Write operations. To ensure that the contents of a dynamic
memory are maintained, each row of cells must be addressed periodically, typically once every two
milliseconds. A Refresh circuit performs this function. Some dynamic memory chips incorporate a refresh
facility the chips themselves and hence they appear as static memories to the user! such chips are often
referred to as Pseudo-static.

Another feature available on many dynamic memory chips is that once the row address is loaded, successive
locations can be accessed by loading only column addresses. Such block transfers can be carried out
typically at a rate that is double that for transfers involving random addresses. Such a feature is useful when
memory access follows a regular pattern, for example, in a graphics terminal. Because of their high density
and low cost, dynamic memories are widely used in the main memory units of computers. Commercially
available chips range in size from 1k to 4M bits or more, and are available in various organizations like 64k
x 1, 16k x 4, 1MB x 1 etc.

Read Only Memories,


Both static and dynamic RAM chips are volatile, which means that they retain information only
while power is turned on. There are many applications requiring memory devices that retain the stored
information when power is turned off. For example, Chapter 4 describes the need to store a small
program in such a memory, to be used to start the bootstrap process of loading the operating system
from a hard disk into the main memory. The embedded applications described in Chapters 10 and 11 are
another important example. Many embedded applications do not use a hard disk and require non-volatile
memories to store their software.
Different types of non-volatile memories have been developed. Generally, their contents can be
read in the same way as for their volatile counterparts discussed above. But, a special writing process is

Dept. of ECE, AIT 9 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
needed to place the information into a non-volatile memory. Since its normal operation involves only
reading the stored data, a memory of this type is called a read-only memory (ROM).

A memory is called a read-only memory, or ROM, when information can be written into it only once at the time of
manufacture. Figure 8.11 shows a possible configuration for a ROM cell. A logic value 0 is stored in the cell if the transistor
is connected to ground at point P; otherwise, a 1 is stored. The bit line is connected through a resistor to the power supply. To
read the state of the cell, the word line is activated to close the transistor switch. As a result, the voltage on the bit line drops
to near zero if there is a connection between the transistor and ground. If there is no connection to ground, the bit line
remains at the high voltage level, indicating a 1. A sense circuit at the end of the bit line generates the proper output value.
The state of the connection to ground in each cell is determined when the chip is manufactured, using a mask with a pattern
that represents the information to be stored.

PROM

Some ROM designs allow the data to be loaded by the user, thus providing a programmable ROM
(PROM). Programmability is achieved by inserting a fuse at point P in Figure 8.11. Before it is
programmed, the memory contains all 0s. The user can insert 1s at the required locations by burning out
the fuses at these locations using high-current pulses. Of course, this process is irreversible.

PROMs provide flexibility and convenience not available with ROMs. The cost of preparing the masks
needed for storing a particular information pattern makes ROMs cost effective only in large volumes.
The alternative technology of PROMs provides a more convenient and considerably less expensive
approach, because memory chips can be programmed directly by the user.

EPROM

Another type of ROM chip provides an even higher level of convenience. It allows the stored data to be
erased and new data to be written into it. Such an erasable, reprogrammable ROM is usually called an
EPROM. It provides considerable flexibility during the development phase of digital systems. Since
EPROMs are capable of retaining stored information for a long time, they can be used in place of ROMs
or PROMs while software is being developed. In this way, memory changes and updates can be easily
made.

An EPROM cell has a structure similar to the ROM cell in Figure 8.11. However, the connection to
ground at point P is made through a special transistor. The transistor is normally turned off, creating an

Dept. of ECE, AIT 10 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
open switch. It can be turned on by injecting charge into it that becomes trapped inside. Thus, an
EPROM cell can be used to construct a memory in the same way as the previously discussed ROM cell.
Erasure requires dissipating the charge trapped in the transistors that form the memory cells. This can be
done by exposing the chip to ultraviolet light, which erases the entire contents of the chip. To make this
possible, EPROM chips are mounted in packages that have transparent windows.

EEPROM

An EPROM must be physically removed from the circuit for reprogramming. Also, the stored
information cannot be erased selectively. The entire contents of the chip are erased when exposed to
ultraviolet light. Another type of erasable PROM can be programmed, erased, and reprogrammed
electrically. Such a chip is called an electrically erasable PROM, or EEPROM. It does not have to be
removed for erasure. Moreover, it is possible to erase the cell contents selectively. One disadvantage of
EEPROMs is that different voltages are needed for erasing, writing, and reading the stored data, which
increases circuit complexity. However, this disadvantage is outweighed by the many advantages of
EEPROMs. They have replaced EPROMs in practice.

Speed, Size, and Cost, - Memory Hierarchy


We have already stated that an ideal memory would be fast, large, and inexpensive. From the discussion
in Section 8.2, it is clear that a very fast memory can be implemented using static RAM chips. But, these
chips are not suitable for implementing large memories, because their basic cells are larger and consume
more power than dynamic RAM cells.

Although dynamic memory units with gigabyte capacities can be implemented at a reasonable cost, the
affordable size is still small compared to the demands of large programs with voluminous data. A
solution is provided by using secondary storage, mainly magnetic disks, to provide the required memory
space. Disks are available at a reasonable cost, and they are used extensively in computer systems.
However, they are much slower than semiconductor memory units. In summary, a very large amount of
cost-effective storage can be provided by magnetic disks, and a large and considerably faster, yet
affordable, main memory can be built with dynamic RAM technology. This leaves the more expensive
and much faster static RAM technology to be used in smaller units where speed is of the essence, such
as in cache memories.

All of these different types of memory units are employed effectively in a computer system. The entire
computer memory can be viewed as the hierarchy depicted in Figure 8.14. The fastest access is to data
held in processor registers. Therefore, if we consider the registers to be part of the memory hierarchy,
then the processor registers are at the top in terms of speed of access. Of course, the registers provide
only a minuscule portion of the required memory.

At the next level of the hierarchy is a relatively small amount of memory that can be implemented
directly on the processor chip. This memory, called a processor cache, holds copies of the instructions
and data stored in a much larger memory that is provided externally. There are often two or more levels
of cache. A primary cache is always located on the processor chip. This cache is small and its access
time is comparable to that of processor registers. The primary cache is referred to as the level 1 (L1)
cache. A larger, and hence somewhat slower, secondary cache is placed between the primary cache and
the rest of the memory. It is referred to as the level 2 (L2) cache. Often, the L2 cache is also housed on
the processor chip.

Dept. of ECE, AIT 11 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)

Some computers have a level 3 (L3) cache of even larger size, in addition to the L1 and L2 caches. An
L3 cache, also implemented in SRAM technology, may or may not be on the same chip with the
processor and the L1 and L2 caches.

The next level in the hierarchy is the main memory. This is a large memory implemented using dynamic
memory components, typically assembled in memory modules such as DIMMs, as described in Section
8.2.5. The main memory is much larger but significantly slower than cache memories. In a computer
with a processor clock of 2 GHz or higher, the access time for the main memory can be as much as 100
times longer than the access time for the L1 cache.

Disk devices provide a very large amount of inexpensive memory, and they are widely used as
secondary storage in computer systems. They are very slow compared to the main memory. They
represent the bottom level in the memory hierarchy.

During program execution, the speed of memory access is of utmost importance. The key to managing
the operation of the hierarchical memory system in Figure 8.14 is to bring the instructions and data that
are about to be used as close to the processor as possible. This is the main purpose of using cache
memories, which we discuss next.

Dept. of ECE, AIT 12 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)

Cache Memories – Mapping Functions,


The cache is a small and very fast memory, interposed between the processor and the main memory. Its purpose is to
make the main memory appear to the processor to be much faster than it actually is. The effectiveness of this approach is
based on a property of computer programs called locality of reference. Analysis of programs shows that most of their
execution time is spent in routines in which many instructions are executed repeatedly. These instructions may constitute a
simple loop, nested loops, or a few procedures that repeatedly call each other. The actual detailed pattern of instruction
sequencing is not important—the point is that many instructions in localized areas of the program are executed repeatedly
during some time period. This behaviour manifests itself in two ways: temporal and spatial. The first means that a recently
executed instruction is likely to be executed again very soon. The spatial aspect means that instructions close to a recently
executed instruction are also likely to be executed soon.

Conceptually, operation of a cache memory is very simple. The memory control circuitry is designed to take
advantage of the property of locality of reference. Temporal locality suggests that whenever an information item, instruction
or data, is first needed, this item should be brought into the cache, because it is likely to be needed again soon. Spatial
locality suggests that instead of fetching just one item from the main memory to the cache, it is useful to fetch several items
that are located at adjacent addresses as well. The term cache block refers to a set of contiguous address locations of some
size. Another term that is often used to refer to a cache block is a cache line.
Consider the arrangement in Figure 8.15. When the processor issues a Read request, the contents of a block of
memory words containing the location specified are transferred into the cache. Subsequently, when the program references
any of the locations in this block, the desired contents are read directly from the cache. Usually, the cache memory can store a
reasonable number of blocks at any given time, but this number is small compared to the total number of blocks in the main
memory. The correspondence between the main memory blocks and those in the cache is specified by a mapping function.
When the cache is full and a memory word (instruction or data) that is not in the cache is referenced, the cache control
hardware must decide which block should be removed to create space for the new block that contains the referenced word.
The collection of rules for making this decision constitutes the cache’s replacement algorithm.

Cache Hits
The processor does not need to know explicitly about the existence of the cache. It simply issues Read and Write
requests using addresses that refer to locations in the memory. The cache control circuitry determines whether the requested
word currently exists in the cache. If it does, the Read or Write operation is performed on the appropriate cache location. In
this case, a read or write hit is said to have occurred. The main memory is not involved when there is a cache hit in a Read
operation. For a Write operation, the system can proceed in one of two ways. In the first technique, called the write-through
protocol, both the cache location and the main memory location are updated. The second technique is to update only the
cache location and to mark the block containing it with an associated flag bit, often called the dirty or modified bit. The main
memory location of the word is updated later, when the block containing this marked word is removed from the cache to
make room for a new block. This technique is known as the write-back, or copy-back, protocol.

Dept. of ECE, AIT 13 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
The write-through protocol is simpler than the write-back protocol, but it results in unnecessary Write operations in
the main memory when a given cache word is updated several times during its cache residency. The write-back protocol also
involves unnecessary Write operations, because all words of the block are eventually written back, even if only a single word
has been changed while the block was in the cache. The write-back protocol is used most often, to take advantage of the high
speed with which data blocks can be transferred to memory chips.

Cache Misses
A Read operation for a word that is not in the cache constitutes a Read miss. It causes the block of words containing
the requested word to be copied from the main memory into the cache. After the entire block is loaded into the cache, the
particular word requested is forwarded to the processor. Alternatively, this word may be sent to the processor as soon as it is
read from the main memory. The latter approach, which is called load-through, or early restart, reduces the processor’s
waiting time somewhat, at the expense of more complex circuitry.

When a Write miss occurs in a computer that uses the write-through protocol, the information is written directly into
the main memory. For the write-back protocol, the block containing the addressed word is first brought into the cache, and
then the desired word in the cache is overwritten with the new information.
Recall from Section 6.7 that resource limitations in a pipelined processor can cause instruction execution to stall for one or
more cycles. This can occur if a Load or Store instruction requests access to data in the memory at the same time that a
subsequent instruction is being fetched. When this happens, instruction fetch is delayed until the data access operation is
completed. To avoid stalling the pipeline, many processors use separate caches for instructions and data, making it possible
for the two operations to proceed in parallel.

There are several possible methods for determining where memory blocks are placed in the cache. It is instructive to
describe these methods using a specific small example. Consider a cache consisting of 128 blocks of 16 words each, for a
total of 2048 (2K) words, and assume that the main memory is addressable by a 16-bit address. The main memory has 64K
words, which we will view as 4K blocks of 16 words each. For simplicity, we have assumed that consecutive addresses refer
to consecutive words.

Direct Mapping
The simplest way to determine cache locations in which to store memory blocks is the direct-mapping technique. In
this technique, block j of the main memory maps onto block j modulo 128 of the cache, as depicted in Figure 8.16. Thus,
whenever one of the main memory blocks 0, 128, 256, . . . is loaded into the cache, it is stored in cache block 0. Blocks 1,
129, 257, . . . are stored in cache block 1, and so on. Since more than one memory block is mapped onto a given cache block
position, contention may arise for that position even when the cache is not full. For example, instructions of a program may
start in block 1 and continue in block 129, possibly after a branch. As this program is executed, both of these blocks must be
transferred to the block-1 position in the cache. Contention is resolved by allowing the new block to overwrite the currently
resident block.

With direct mapping, the replacement algorithm is trivial. Placement of a block in the cache is determined by its
memory address. The memory address can be divided into three fields, as shown in Figure 8.16. The low-order 4 bits select
one of 16 words in a block. When a new block enters the cache, the 7-bit cache block field determines the cache position in
which this block must be stored. The high-order 5 bits of the memory address of the block are stored in 5 tag bits associated
with its location in the cache. The tag bits identify which of the 32 main memory blocks mapped into this cache position is
currently resident in the cache. As execution proceeds, the 7-bit cache block field of each address generated by the processor
points to a particular block location in the cache. The high-order 5 bits of the address are compared with the tag bits
associated with that cache location. If they match, then the desired word is in that block of the cache. If there is no match,
then the block containing the required word must first be read from the main memory and loaded into the cache. The direct-
mapping technique is easy to implement, but it is not very flexible.

Dept. of ECE, AIT 14 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)

Associative Mapping
Figure 8.17 shows the most flexible mapping method, in which a main memory block can be placed into any cache
block position. In this case, 12 tag bits are required to identify a memory block when it is resident in the cache. The tag bits
of an address received from the processor are compared to the tag bits of each block of the cache to see if the desired block is
present. This is called the associative-mapping technique. It gives complete freedom in choosing the cache location in which
to place the memory block, resulting in a more efficient use of the space in the cache. When a new block is brought into the
cache, it replaces (ejects) an existing block only if the cache is full. In this case, we need an algorithm to select the block to
be replaced. Many replacement algorithms are possible, as we discuss in Section 8.6.2. The complexity of an associative
cache is higher than that of a direct-mapped cache, because of the need to search all 128 tag patterns to determine whether a
given block is in the cache. To avoid a long delay, the tags must be searched in parallel. A search of this kind is called an
associative search.

Dept. of ECE, AIT 15 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)

Set-Associative Mapping
Another approach is to use a combination of the direct- and associative-mapping techniques. The
blocks of the cache are grouped into sets, and the mapping allows a block of the main memory to reside
in any block of a specific set. Hence, the contention problem of the direct method is eased by having a
few choices for block placement. At the same time, the hardware cost is reduced by decreasing the size
of the associative search. An example of this set-associative-mapping technique is shown in Figure 8.18
for a cache with two blocks per set. In this case, memory blocks 0, 64, 128, . . . , 4032 map into cache
set 0, and they can occupy either of the two block positions within this set. Having 64 sets means that
the 6-bit set field of the address determines which set of the cache might contain the desired block. The
tag field of the address must then be associatively compared to the tags of the two blocks of the set to
check if the desired block is present. This two-way associative search is simple to implement.

The number of blocks per set is a parameter that can be selected to suit the requirements of a
particular computer. For the main memory and cache sizes in Figure 8.18, four blocks per set can be
accommodated by a 5-bit set field, eight blocks per set by a 4-bit set field, and so on. The extreme
condition of 128 blocks per set requires no set bits and corresponds to the fully-associative technique,
with 12 tag bits. The other extreme of one block per set is the direct-mapping method. A cache that has k
blocks per set is referred to as a k-way set-associative cache.

Dept. of ECE, AIT 16 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)

Cache Memories – Replacement Algorithms,


In a direct-mapped cache, the position of each block is predetermined by its address; hence, the
replacement strategy is trivial. In associative and set-associative caches there exists some flexibility.
When a new block is to be brought into the cache and all the positions that it may occupy are full, the
cache controller must decide which of the old blocks to overwrite. This is an important issue, because
the decision can be a strong determining factor in system performance. In general, the objective is to
keep blocks in the cache that are likely to be referenced in the near future. But, it is not easy to
determine which blocks are about to be referenced. The property of locality of reference in programs
gives a clue to a reasonable strategy. Because program execution usually stays in localized areas for
reasonable periods of time, there is a high probability that the blocks that have been referenced recently
will be referenced again soon. Therefore, when a block is to be overwritten, it is sensible to overwrite
the one that has gone the longest time without being referenced. This block is called the least recently
used (LRU) block, and the technique is called the LRU replacement algorithm.
To use the LRU algorithm, the cache controller must track references to all blocks as
computation proceeds. Suppose it is required to track the LRU block of a four-block set in a set-
associative cache. A 2-bit counter can be used for each block. When a hit occurs, the counter of the

Dept. of ECE, AIT 17 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
block that is referenced is set to 0. Counters with values originally lower than the referenced one are
incremented by one, and all others remain unchanged. When a miss occurs and the set is not full, the
counter associated with the new block loaded from the main memory is set to 0, and the values of all
other counters are increased by one. When a miss occurs and the set is full, the block with the counter
value 3 is removed, the new block is put in its place, and its counter is set to 0. The other three block
counters are incremented by one. It can be easily verified that the counter values of occupied blocks are
always distinct.
The LRU algorithm has been used extensively. Although it performs well for many access
patterns, it can lead to poor performance in some cases. For example, it produces disappointing results
when accesses are made to sequential elements of an array that is slightly too large to fit into the cache
(see Section 8.6.3 and Problem 8.11). Performance of the LRU algorithm can be improved by
introducing a small amount of randomness in deciding which block to replace.
Several other replacement algorithms are also used in practice. An intuitively reasonable rule
would be to remove the “oldest” block from a full set when a new block must be brought in. However,
because this algorithm does not take into account the recent pattern of access to blocks in the cache, it is
generally not as effective as the LRU algorithm in choosing the best blocks to remove. The simplest
algorithm is to randomly choose the block to be overwritten. Interestingly enough, this simple algorithm
has been found to be quite effective in practice.

Performance Considerations.
Two key factors in the commercial success of a computer are performance and cost; the best
possible performance for a given cost is the objective. A common measure of success is the
price/performance ratio. Performance depends on how fast machine instructions can be brought into the
processor and how fast they can be executed. Chapter 6 shows how pipelining increases the speed of
program execution. In this chapter, we focus on the memory subsystem.

The memory hierarchy described in Section 8.5 results from the quest for the best
price/performance ratio. The main purpose of this hierarchy is to create a memory that the processor
sees as having a short access time and a large capacity. When a cache is used, the processor is able to
access instructions and data more quickly when the data from the referenced memory locations are in the
cache. Therefore, the extent to which caches improve performance is dependent on how frequently the
requested instructions and data are found in the cache. In this section, we examine this issue
quantitatively.

Hit Rate and Miss Penalty

An excellent indicator of the effectiveness of a particular implementation of the memory


hierarchy is the success rate in accessing information at various levels of the hierarchy. Recall that a
successful access to data in a cache is called a hit. The number of hits stated as a fraction of all
attempted accesses is called the hit rate, and the miss rate is the number of misses stated as a fraction of
attempted accesses.

Dept. of ECE, AIT 18 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
Ideally, the entire memory hierarchy would appear to the processor as a single memory unit that has the
access time of the cache on the processor chip and the size of the magnetic disk. How close we get to
this ideal depends largely on the hit rate at different levels of the hierarchy. High hit rates well over 0.9
are essential for high-performance computers.

Performance is adversely affected by the actions that need to be taken when a miss occurs. A
performance penalty is incurred because of the extra time needed to bring a block of data from a slower
unit in the memory hierarchy to a faster unit. During that period, the processor is stalled waiting for
instructions or data. The waiting time depends on the details of the operation of the cache. For example,
it depends on whether or not the load-through approach is used. We refer to the total access time seen by
the processor when a miss occurs as the miss penalty.
Consider a system with only one level of cache. In this case, the miss penalty consists almost
entirely of the time to access a block of data in the main memory. Let h be the hit rate, M the miss
penalty, and C the time to access information in the cache. Thus, the average access time experienced by
the processor is

tavg = hC + (1 − h)M

Basic Processing Unit: Some Fundamental Concepts,


A typical computing task consists of a series of operations specified by a sequence of machine-language
instructions that constitute a program. The processor fetches one instruction at a time and performs the
operation specified. Instructions are fetched from successive memory locations until a branch or a jump
instruction is encountered. The processor uses the program counter, PC, to keep track of the address of
the next instruction to be fetched and executed. After fetching an instruction, the contents of the PC are
updated to point to the next instruction in sequence. A branch instruction may cause a different value to
be loaded into the PC.

When an instruction is fetched, it is placed in the instruction register, IR, from where it is interpreted, or
decoded, by the processor’s control circuitry. The IR holds the instruction until its execution is
completed.

Consider a 32-bit computer in which each instruction is contained in one word in the memory, as in
RISC-style instruction set architecture. To execute an instruction, the processor has to perform the
following steps:

1. Fetch the contents of the memory location pointed to by the PC. The contents of this location are
the instruction to be executed; hence they are loaded into the IR. In register transfer notation, the
required action is
IR←[[PC]]
2. Increment the PC to point to the next instruction. Assuming that the memory is byte addressable,
the PC is incremented by 4; that is
PC←[PC] + 4
3. Carry out the operation specified by the instruction in the IR.

Dept. of ECE, AIT 19 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)

Fetching an instruction and loading it into the IR is usually referred to as the instruction fetch phase.
Performing the operation specified in the instruction constitutes the instruction execution phase.

With few exceptions, the operation specified by an instruction can be carried out by performing one or
more of the following actions:
• Read the contents of a given memory location and load them into a processor register.
• Read data from one or more processor registers.
• Perform an arithmetic or logic operation and place the result into a processor register.
• Store data from a processor register into a given memory location.

Dept. of ECE, AIT 20 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)

Fundamental concepts - Register Transfers


Instruction execution involves a sequence of steps in which data are transferred from one register to
another. For each register, two control signals are used to place the contents of that register on the bus or
to load the data on the bus into the register. This is represented symbolically in Figure 7.2. The input and
output of register Ri are connected to the bus via switches controlled by the signals Rin and Rout,
respectively. When Ri in is set to 1, the data on the bus are loaded into Ri. Similarly, when R i out is set to
1, the contents of register Ri are placed on the bus. While Riout is equal to 0, the bus can be used for
transferring data from other registers.

Suppose that we wish to transfer the contents of register R1 to register R4. This can be accomplished as
follows:

Dept. of ECE, AIT 21 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
Enable the output of register R1 by setting Rlout to 1. This places the contents of R1 on the processor
bus.

Enable the input of register R4 by setting R4in, to 1. This loads data from the processor bus into register
R4.

All operations and data transfers within the processor take place within time periods. defined by the
processor clock. The control signals that govern a particular transfer are asserted at the start of the clock
cycle. In our example, Rlout and R4in, are set to 1. The registers consist of edge-triggered flip-flops.
Hence, at the next active edge of the clock, the flip-flops that constitute R4 will load the data present at
their inputs. At the same time, the control signals R1out and R4in, will return to 0. We will use this
simple model of the timing of data transfers for the rest of this chapter. However, we should point out
that other schemes are possible. For example, data transfers may use both the rising and falling edges of
the clock. Also, when edge-triggered flip-flops are not used, two or more clock signals may be needed to
guarantee proper transfer of data. This is known as multiphase clocking.

An implementation for one bit of register Ri is shown in Figure 7.3 as an example. A two-input
multiplexer is used to select the data applied to the input of an edge-triggered D flip-flop. When the
control input Ri, is equal to 1, the multiplexer selects the data on the bus. This data will be loaded into
the flip-flop at the rising edge of the clock. When Rim is equal to 0, the multiplexer feeds back the value
currently stored in the flip-flop.

The Q output of the flip-flop is connected to the bus via a tri-state gate. When Riout is equal to 0, the
gate's output is in the high-impedance (electrically disconnected) state. This corresponds to the open-
circuit state of a switch. When Riout 1, the gate drives the bus to 0 or 1, depending on the value of Q.

Fundamental concepts – Performing an ALU operation


The ALU is a combinational circuit that has no internal storage. It performs arithmetic and logic
operations on the two operands applied to its A and B inputs. In Figures 7.1 and 7.2, one of the operands
is the output of the multiplexer MUX and the other operand is obtained directly from the bus. The result
produced by the ALU is stored temporarily in register Z. Therefore, a sequence of operations to add the
contents of register R1 to those of register R2 and store the result in register R3 is

Dept. of ECE, AIT 22 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
R3 = R1 + R2

1. Rlout, Yin

2. R2out, SelectY, Add, Zin

3. Zout R3in

The signals whose names are given in any step are activated for the duration of the clock cycle
corresponding to that step. All other signals are inactive. Hence, in step 1, the output of register R1 and
the input of register Y are enabled, causing the contents of R1 to be transferred over the bus to Y. In step
2, the multiplexer's Select signal is set to SelectY, causing the multiplexer to gate the contents of register
Y to input A of the ALU. At the same time, the contents of register R2 are gated onto the bus and,
hence, to input B. The function performed by the ALU depends on the signals applied to its control
lines. In this case, the Add line is set to 1, causing the output of the ALU to be the sum of the two
numbers at inputs A and B. This sum is loaded into register Z because its input control signal is
activated. In step 3, the contents of register Z are transferred to the destination register, R3. This last
transfer cannot be carried out during step 2, because only one register output can be connected to the bus
during any clock cycle.

In this introductory discussion, we assume that there is a dedicated signal for each function to be
performed. For example, we assume that there are separate control signals to specify individual ALU
operations, such as Add, Subtract, XOR, and so on. In reality, some degree of encoding is likely to be
used. For example, if the ALU can perform eight different operations, three control signals would
suffice to specify the required operation.

Fundamental concepts – Fetching a word from Memory


To fetch a word of information from memory, the processor has to specify the address of the memory
location where this information is stored and request a Read operation. This applies whether the
information to be fetched represents an instruction in a program or an operand specified by an
instruction. The processor transfers the required address to the MAR, whose output is connected to the
address lines of the memory bus. At the same time, the processor uses the control lines of the memory
bus to indicate that a Read operation is needed. When the requested data are received from the memory
they are stored in register MDR, from where they can be transferred to other registers in the processor.

The connections for register MDR are illustrated in Figure 7.4. It has four control signals: MDRin and
MDRout control the connection to the internal bus, and MDRinE and MDRoutE control the connection to
the external bus. The circuit in Figure 7.3 is easily modified to provide the additional connections. A
three-input multiplexer can be used, with the memory bus data line connected to the third input. This
input is selected when MDRinE = 1. A second tri-state gate, controlled by MDRourE can be used to
connect the output of the flip-flop to the memory bus.

During memory Read and Write operations, the timing of internal processor operations must be
coordinated with the response of the addressed device on the memory bus. The processor completes one
internal data transfer in one clock cycle. The speed of operation of the addressed device, on the other

Dept. of ECE, AIT 23 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
hand, varies with the device. We saw in Chapter 5 that modern processors include a cache memory on
the same chip as the processor. Typically, a cache will respond to a memory read request in one clock
cycle. However, when a cache miss occurs, the request is forwarded to the main memory, which
introduces a delay of several clock cycles. A read or write request may also be intended for a register in
a memory-mapped I/O device.

Such I/O registers are not cached, so their accesses always take a number of clock cycles.

To accommodate the variability in response time, the processor waits until it receives an indication
that the requested Read operation has been completed. We will assume that a control signal called
Memory-Function-Completed (MFC) is used for this purpose. The addressed device sets this signal to
1 to indicate that the contents of the specified location have been read and are available on the data
lines of the memory bus. (We encountered several examples of such a signal in conjunction with the
buses discussed in Chapter 4, such as Slave-ready in Figure 4.25 and TRDY# in Figure 4.41.) As an
example of a read operation, consider the instruction Move (R1),R2. The actions needed to execute
this instruction are:
1. MAR← [R1]
2. Start a Read operation on the memory bus
3. Wait for the MFC response from the memory
4. Load MDR from the memory bus
5. R2 ← [MDR]
These actions may be carried out as separate steps, but some can be combined into a single step. Each
action can be completed in one clock cycle, except action 3 which requires one or more clock cycles,
depending on the speed of the addressed device.

For simplicity, let us assume that the output of MAR is enabled all the time. Thus, the contents of
MAR are always available on the address lines of the memory bus. This is the case when the processor
is the bus master. When a new address is loaded into MAR, it will appear on the memory bus at the
beginning of the next clock cycle, as shown in Figure 7.5. A Read control signal is activated at the
same time MAR is loaded. This signal will cause the bus interface circuit to send a read command,
MR, on the bus. With this arrangement, we have combined actions 1 and 2 above into a single control
step. Actions 3 and 4 can also be combined by activating control signal MDR ing while waiting for a
response from the memory. Thus the data received from the memory are loaded into MDR at the end

Dept. of ECE, AIT 24 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
of the clock cycle in which the MFC signal is received. In the next clock cycle, MDRout is activated
to transfer the data to register R2. This means that the memory read operation requires three steps,
which can be described by the signals being activated as follows:
1. Rlout, MARin, Read
2. MDRINE, WMFC
3. MDRout, R2in
where WMFC is the control signal that causes the processor's control circuitry to wait for the arrival
of the MFC signal.
Figure 7.5 shows that MDR inE is set to 1 for exactly the same period as the read command, MR.
Hence, in subsequent discussion, we will not specify the value of MDRinE explicitly, with the
understanding that it is always equal to MR.

Fundamental concepts – Storing a word in Memory


Writing a word into a memory location follows a similar procedure. The desired address is loaded into
MAR. Then, the data to be written are loaded into MDR, and a Write command is issued. Hence,
executing the instruction Move R2,(R1) requires the following sequence:

1. R1out , MARin

2. R2out , MDRin , Write

3. MDRoutE, WMFC

Dept. of ECE, AIT 25 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
As in the case of the read operation, the Write control signal causes the memory bus interface hardware
to issue a Write command on the memory bus. The processor remains in step 3 until the memory
operation is completed and an MFC response is received.

Execution of a Complete Instruction,


Let us now put together the sequence of elementary operations required to execute one Instruction.
Consider the instruction

Add (R3), R1

which adds the contents of a memory location pointed to by R3 to register R1. Executing this instruction
requires the following actions:

1. Fetch the instruction.


2. Fetch the first operand (the contents of the memory location pointed to by R3).
3. Perform the addition.
4. Load the resalt into R1.

Figure 7.6 gives the sequence of control steps required to perform these openstions for the single-bus
architecture of Figure 7.1. Instruction execution proceeds as follows. In step 1, the instruction fetch
operation is initiated by loading the contents of the PC into the MAR and sending a Read request to the
memory. The Select signal is set to Select4, which causes the multiplexer MUX to select the constant 4.
This value is added to the operand at input B, which is the contents of the PC, and the result is stored in
register Z. The updated value is moved from register Z back into the PC during step 2. while waiting for
the memory to respond at step 3, the word fetched from the memory is loaded into the IR.

Steps 1 through 3 constitute the instruction fetch phase, which is the same for all instructions. The
instruction decoding circuit interprets the contents of the IR at the beginning of step 4. This enables the
control circuitry to activate the control signals for steps 4 through 7, which constitute the execution
phase. The contents of register R3 are transferred to the MAR in step 4, and a memory read operation is
initiated. Then the contents of R1 are transferred to register Y in step 5, to prepare for the addition
operation. When the Read operation is completed, the memory operand is available in register MDR,
and the addition operation is performed in step 6. The contents of MDR are gated to the bus, and thus
also to the B input of the ALU, and register Y is selected as the second input to the ALU by choosing
SelectY. The sum is stored in register Z, then transferred to R1 in step 7. The End signal causes a new
instruction fetch cycle to begin by returning to step 1.
This discussion accounts for all control signals in Figure 7.6 except Y, in step 2. There is no need to
copy the updated contents of PC into register Y when executing the Add instruction. But, in Branch
instructions the updated value of the PC is needed to compute the Branch target address. To speed up the
execution of Branch instructions, this value is copied into register Y in step 2. Since step 2 is part of the

Dept. of ECE, AIT 26 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
fetch phase, the same action will be performed for all instructions. This does not cause any harm
because register Y is not used for any other purpose at that time.

Step Action
1 PCout, MARin, Read, Select4, Add, Zin
2 Zout , PCin , Yin , WMFC
3 MDRout , IRin
4 R3out, MARin, Read
5 Rlout , Yin , WMFC
6 MDRout, SelectY, Add, Zin
7 Zout, Rlin, End
Figure 7.6 Control sequence for execution of the instruction Add (R3),R1.

Multiple Bus Organization:


We used the simple single-bus structure of Figure 7.1 to illustrate the basic ideas. The resulting control
sequences in Figures 7.6 and 7.7 are quite long because only one data item can be transferred over the
bus in a clock cycle. To reduce the number of steps needed, most commercial processors provide
multiple internal paths that enable several transfers to take place in parallel.

Figure 7.8 depicts a three-bus structure used to connect the registers and the ALU of a processor. All
general-purpose registers are combined into a single block called the register file. In VLSI technology,
the most efficient way to implement a number of registers is in the form of an array of memory cells
similar to those used in the implementation of random-access memories (RAMs) described in Chapter 5.
The register file in Figure 7.8 is said to have three ports. There are two outputs, allowing the contents of
two different registers to be accessed simultaneously and have their contents placed on buses A and B.
The third port allows the data on bus C to be loaded into a third register during the same clock cycle.

Buses A and B are used to transfer the source operands to the A and B inputs of the ALU, where an
arithmetic or logic operation may be performed. The result is transferred to the destination over bus C. If
needed, the ALU may simply pass one of its two input operands unmodified to bus C. We will call the
ALU control signals for such an operation R-A or R-B. The three-bus arrangement obviates the need for
registers Y and Z in Figure 7.1.

A second feature in Figure 7.8 is the introduction of the Incrementor unit, which is used to increment the
PC by 4. Using the Incrementor eliminates the need to add 4 to the PC using the main ALU, as was done
in Figures 7.6 and 7.7. The source for the constant 4 at the ALU input multiplexer is still useful. It can
be used to increment other addresses, such as the memory addresses in Load Multiple (LDM) and Store
Multiple (STM) instructions.

Dept. of ECE, AIT 27 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)

Consider the three-operand instruction

Add R4,R5,R6

The control sequence for executing this instruction is given in Figure 7.9. In step 1, the contents of the
PC are passed through the ALU, using the R-B control signal, and loaded into the MAR to start a
memory read operation. At the same time the PC is incremented by 4. Note that the value loaded into
MAR is the original contents of the PC. The incremented value is loaded into the PC at the end of the
clock cycle and will not affect the contents of MAR. In step 2, the processor waits for MFC and loads
the data received into MDR, then transfers them to IR in step 3. Finally, the execution phase of the
instruction requires only one control step to complete, step 4.

Dept. of ECE, AIT 28 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)

By providing more paths for data transfer a significant reduction in the number of clock cycles needed to
execute an instruction is achieved.

Hard-wired Control,
To execute instructions, the processor must have some means of generating the con- trol signals needed
in the proper sequence. Computer designers use a wide variety of techniques to solve this problem. The
approaches used fall into one of two categories: hardwired control and microprogrammed control. We
discuss each of these techniques in detail, starting with hardwired control in this section.

Consider the sequence of control signals given in Figure 7.6. Each step in this sequence is completed in
one clock period. A counter may be used to keep track of the control steps, as shown in Figure 7.10.
Each state, or count, of this counter corresponds to one control step. The required control signals are
determined by the following information:

Dept. of ECE, AIT 29 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
• Contents of the control step counter
• Contents of the instruction register
• Contents of the condition code flags
• External input signals, such as MFC and interrupt requests

To gain insight into the structure of the control unit, we start with a simplified view of the hardware
involved. The decoder/encoder block in Figure 7.10 is a combinational circuit that generates the required
control outputs, depending on the state of all its inputs. By separating the decoding and encoding
functions, we obtain the more detailed block diagram in Figure 7.11. The step decoder provides a
separate signal line for each step, or time slot, in the control sequence. Similarly, the output of the
instruction decoder consists of a separate line for each machine instruction. For any instruction loaded in
the IR, one of the output lines INS1 through INSm is set to 1, and all other lines are set to 0. (For design
details of decoders, refer to Appendix A.) The input signals to the encoder block in Figure 7.11 are
combined to generate the individual control signals Yin, PCout, Add, End, and so on. An example of
how the encoder generates the Zin control signal for the processor organization in Figure 7.1 is given in
Figure 7.12. This circuit implements the logic function

Zin = T₁ + T6 * ADD + T4 BR+・・・ [7.1]

This signal is asserted during time slot T₁ for all instructions, during To for an Add instruction, during
T4 for an unconditional branch instruction, and so on. The logic function for Zin is derived from the
control sequences in Figures 7.6 and 7.7.

First, we introduce some common terms. A control word (CW) is a word whose individual bits represent
the various control signals in Figure 7.11. Each of the control steps in the control sequence of an
instruction defines a unique combination of 1s and 0s in the CW. The CWs corresponding to the 7 steps
of Figure 7.6 are shown in Figure 7.15. We have assumed that SelectY is represented by Select = 0 and
Select4 by Select = 1. A sequence of CWs corresponding to the control sequence of a machine

Dept. of ECE, AIT 30 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
instruction constitutes the micro routine for that instruction, and the individual control words in this
microroutine are referred to as microinstructions.

The microroutines for all instructions in the instruction set of a computer are stored in a special memory
called the control store. The control unit can generate the control signals for any instruction by
sequentially reading the CWs of the corresponding microroutine from the control store. This suggests
organizing the control unit as shown in Figure 7.16. To read the control words sequentially from the
control store, a micro- program counter (µPC) is used. Every time a new instruction is loaded into the
IR, the output of the block labeled "starting address generator" is loaded into the MPC. The PC is then
automatically incremented by the clock, causing successive microinstructions to be read from the control
store. Hence, the control signals are delivered to various parts of the processor in the correct

Micro programmed Control.


Pipelining is a particularly

In Section 7.4, we saw how the control signals required inside the processor
can be generated using a control step counter and a decoder/encoder circuit.
Now we discuss an alternative scheme, called microprogrammed control, in
which control signals are generated by a program similar to machine language
programs.

Basic concepts of pipelining,

Dept. of ECE, AIT 31 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
Pipelining is a particularly effective way of organizing concurrent activity in a computer system. The
basic idea is very simple. It is frequently encountered in manufacturing plants, where pipelining is
commonly known as an assembly-line operation. Readers are undoubtedly familiar with the assembly
line used in car manufacturing. The first station in an assembly line may prepare the chassis of a car, the
next station adds the body, the next one installs the engine, and so on. While one group of workers is
installing the engine on one car, another group is fitting a car body on the chassis of another car, and yet
another group is preparing a new chassis for a third car. It may take days to complete work on a given
car, but it is possible to have a new car rolling off the end of the assembly line every few minutes.

Consider how the idea of pipelining can be used in a computer. The processor executes a program by
fetching and executing instructions, one after the other. Let Fi and Ei refer to the fetch and execute steps
for instruction Ii. Execution of a program consists of a sequence of fetch and execute steps, as shown in
Figure 8.1a.

Now consider a computer that has two separate hardware units, one for fetching instructions and another
for executing them, as shown in Figure 8.16. The instruction fetched by the fetch unit is deposited in an
intermediate storage buffer, B1. This buffer is needed to enable the execution unit to execute the
instruction while the fetch unit is fetching the next instruction. The results of execution are deposited in
the destination location specified by the instruction. For the purposes of this discussion, we assume that

Dept. of ECE, AIT 32 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
both the source and the destination of the data operated on by the instructions are inside the block
labelled "Execution unit."

The computer is controlled by a clock whose period is such that the fetch and execute steps of any
instruction can each be completed in one clock cycle. Operation of the computer proceeds as in Figure
8.1c. In the first clock cycle, the fetch unit fetches an instruction I, (step F₁) and stores it in buffer B1 at
the end of the clock cycle. In the second clock cycle, the instruction fetch unit proceeds with the fetch
operation for instruction I₂ (step F₂). Meanwhile, the execution unit performs the operation specified by
instruction I₁, which is available to it in buffer B1 (step E₁). By the end of the second clock cycle, the
execution of instruction I1 is completed and instruction I₂ is available. Instruction I₂ is stored in B1,
replacing I₁, which is no longer needed. Step Е2 is performed by the execution unit during the third clock
cycle, while instruction I3 is being fetched by the fetch unit. In this manner, both the fetch and execute
units are kept busy all the time. If the pattern in Figure 8.1c can be sustained for a long time, the
completion rate of instruction execution will be twice that achievable by the sequential operation
depicted in Figure 8.1a.

In summary, the fetch and execute units in Figure 8.1b constitute a two-stage pipeline in which each
stage performs one step in processing an instruction. An inter- stage storage buffer, B1, is needed to hold
the information being passed from one stage to the next. New information is loaded into this buffer at
the end of each clock cycle. The processing of an instruction need not be divided into only two steps.
For example, a pipelined processor may process each instruction in four steps, as follows:

F Fetch: read the instruction from the memory.

D Decode: decode the instruction and fetch the source operand(s).

E Execute: perform the operation specified by the instruction.

W Write: store the result in the destination location.

The sequence of events for this case is shown in Figure 8.2a. Four instructions are in progress at any
given time. This means that four distinct hardware units are needed, as shown in Figure 8.2b. These
units must be capable of performing their tasks simultaneously and without interfering with one another.
Information is passed from one unit to the next through a storage buffer. As an instruction progresses
through the pipeline, all the information needed by the stages downstream must be passed along. For
example, during clock cycle 4, the information in the buffers is as follows:

Buffer B1 holds instruction I3, which was fetched in cycle 3 and is being decoded by the instruction-
decoding unit.

The speed of execution of programs is influenced by many factors. One way to improve performance is
to use faster circuit technology to implement the processor and the main memory. Another possibility is
to arrange the hardware so that more than one operation can be performed at the same time. In this way,
the number of operations performed per second is increased, even though the time needed to perform
any one operation is not changed.

Dept. of ECE, AIT 33 Dr. Nataraju A B


ARM Embedded Systems Module-2 (21EC52)
Pipelining is a particularly effective way of organizing concurrent activity in a computer system. The
basic idea is very simple. It is frequently encountered in manufacturing plants, where pipelining is
commonly known as an assembly-line operation. Readers are undoubtedly familiar with the assembly
line used in automobile manufacturing.

Consider how the idea of pipelining can be used in a computer. The five-stage processor organization
and the corresponding data path allow instructions to be fetched and executed one at a time. It takes five
clock cycles to complete the execution of each instruction. Rather than wait until each instruction is
completed, instructions can be fetched and executed in a pipelined manner, as shown in Figure 6.1.

The five stages are labelled as Fetch, Decode, Compute, Memory, and Write. Instruction Ij is fetched in
the first cycle and moves through the remaining stages in the following cycles. In the second cycle,
instruction Ij+1 is fetched while instruction Ij is in the Decode stage where its operands are also read
from the register file. In the third cycle, instruction Ij+2 is fetched while instruction Ij+1 is in the
Decode stage and instruction Ij is in the Compute stage where an arithmetic or logic operation is
performed on its operands. Ideally, this overlapping pattern of execution would be possible for all
instructions. Although any one instruction takes five cycles to complete its execution, instructions are
completed at the rate of one per cycle.

Dept. of ECE, AIT 34 Dr. Nataraju A B

You might also like