Computer Architecture Note 2024
Computer Architecture Note 2024
Computer Architecture Note 2024
COURSE OUTLINES
A complete history of computing would include a multitude of diverse devices such as the
ancient Chinese abacus, the Jacquard loom (1805) and Charles Babbage's ``analytical engine''
(1834). It would also include discussion of mechanical, analog and digital computing
architectures. As late as the 1960s, mechanical devices, such as the Marchant calculator, still
found widespread application in science and engineering. During the early days of electronic
computing devices, there was much discussion about the relative merits of analog vs. digital
computers. In fact, as late as the 1960s, analog computers were routinely used to solve systems
of finite difference equations arising in oil reservoir modeling. In the end, digital computing
devices proved to have the power, economics and scalability necessary to deal with large scale
computations. Digital computers now dominate the computing world in all areas ranging from
the hand calculator to the supercomputer and are pervasive throughout society. Therefore, this
brief sketch of the development of scientific computing is limited to the area of digital, electronic
computers.
The evolution of digital computing is often divided into generations. Each generation is
characterized by dramatic improvements over the previous generation in the technology used to
build computers, the internal organization of computer systems, and programming languages.
Although not usually associated with computer generations, there has been a steady
improvement in algorithms, including algorithms used in computational science.
The concept of stored program computers appeared in 1945 when John von Neumann drafted the
first version of EDVAC (Electronic Discrete Variable Computer). Those ideas have since been
the milestones of computers:
Technological Development
The improvements in computer technology have been tremendous since the first machines
appeared. A personal computer that can be bought today with a few thousand dollars, has more
performance (in terms of, say, floating point multiplications per second), more main memory and
more disk capacity than a machine that cost millions in the 50s-60s.
1. Mainframes: large computers that can support very many users while delivering great
computing power. It is mainly in mainframes where most of the innovations (both in architecture
and in organization) have been made.
2. Minicomputers: have adopted many of the mainframe techniques, yet being designed to sell
for less, satisfying the computing needs for smaller groups of users. It is the minicomputer group
that improved at the fastest pace (since 1965 when DEC introduced the first minicomputer, PDP-
8), mainly due to the evolution of integrated circuits technology (the first IC appeared in 1958).
3. Supercomputers: designed for scientific applications, they are the most expensive computers
processing is usually done in batch mode, for reasons of performance.
4. Microcomputers: have appeared in the microprocessor era (the first microprocessor, Intel
4004, was introduced in 1971). The term micro refers only to physical dimensions, not to
computing performance. A typical microcomputer (either a PC or a workstation) nicely fits on a
desk. Microcomputers are a direct product of technological advances: faster CPUs,
semiconductor memories, etc. Over the time many of the concepts previously used in
mainframes and minicomputers have become common place in microcomputers. For many years
the evolution of computers was concerned with the problem of object code compatibility. A new
architecture had to be, at least partly, compatible with older ones. Older programs (“the dusty
deck”) had to run without changes on the new machines. A dramatic example is the IBM-PC
architecture, launched in 1981, it proved so successful that further developments had to conform
with the first release, despite the flaws which became apparent in a couple of years thereafter.
The assembly language is no longer the language in which new applications are written, although
the most sensitive parts continue to be written in assembly language, and this is due to advances
in languages and compiler technology
What is the meaning of saying that a computer is faster than another one? It depends upon the
position you have: if you are a simple user (end user) then you say a computer is faster when it
runs your program in less time, and you think at the time it takes from the moment you launch
your program until you get the results, this the so called wall-clock time. On the other hand, if
you are system's manager, then you say a computer is faster when it completes more jobs per
time unit. As a user you are interested in reducing the response time (also called the execution
time or latency). The computer manager is more interested in increasing the throughput (also
called bandwidth), the number of jobs done in a certain amount of time. Response time,
execution time and throughput are usually connected to tasks and whole computational events.
Latency and bandwidth are mostly used when discussing about memory performance.
a) faster CPU
b) separate processors for different tasks (as in an airline reservation system or in a credit card
processing system)
b) several tasks can be processed at the same time, but no one gets done faster; hence only the
throughput is improved.
In many cases it is not possible to describe the performance, either response-time or throughput,
in terms of constant values but in terms of some statistical distribution. This is especially true for
I/O operations. One can compute the best-case access time for a hard disk as well as the
worstcase access time: what happens in real life is that you have a disk request and the
completion time (response-time) which depends not only upon the hardware characteristics of
the disk (best/worst case access time), but also upon some other facts, like what is the disk doing
at the moment you are issuing the request for service, and how long the queue of waiting tasks is.
Comparing Performance
Suppose we have to compare two machines A and B. The phrase A is n% faster than B means:
Example 2:
#############Therefore machine A is by 16.7% faster than machine B. We can also say that
the performance of the machine A is by 16.7% better than the performance of the machine B
CPU Performance
What is the time the CPU of your machine is spending in running a program? Assuming that
your CPU is driven by a constant rate clock generator (and this is sure the case), we have:
The above formula computes the time CPU spends running a program, not the elapsed time: it
does not make sense to compute the elapsed time as a function of Tck, mainly because the
elapsed time also includes the I/O time, and the response time of I/O devices is not a function of
Tck. If we know the number of instructions that are executed since the program starts until the
very end, lets call this the Instruction Count (IC), then we can compute the average number of
clock cycles per instruction (CPI) as follows:
Unfortunately the above parameters are not independent of each other so that changing one of
them usually affects the others.
Whenever a designer considers an improvement to the machine (i.e. you want a lower CPUtime)
you must thoroughly check how the change affects other parts of the system. For example you
may consider a change in organization such that CPI will be lowered, however this may increase
Tck thus offsetting the improvement in CPI. A final remark: CPI has to be measured and not
simply calculated from the system's specification. This is because CPI strongly depends on the
memory hierarchy organization: a program running on the system without cache will certainly
have a larger CPI than the same program running on the same machine but with a cache.
Designing a computer is a challenging task. It involves software (at least at the level of designing
the instruction set), and hardware as well at all levels: functional organization, logic design,
implementation. Implementation itself deals with designing/specifying ICs, packaging, noise,
power, cooling etc. It would be a terrible mistake to disregard one aspect or other of computer
design, rather the computer designer has to design an optimal machine across all mentioned
levels. You can not find a minimum unless you are familiar with a wide range of technologies,
from compiler and operating system design to logic design and packaging.
2. UNDERSTAND AND ANALYZE COMPUTER SYSTEMS ARCHITECTURE.
Main memory: Stores data - I/O: Moves data between the computer and its external environment. -
System interconnections: Some mechanism that provides for communication among CPU, main memory
and I/O. A common example of system interconnection is by means of a system bus, consisting of a
number of conducting wires to which all the other components attach. However, the most interesting
and complex component is the C. P. U. Its major structural components are as follows: - Control unit:
Controls the operations of the CPU and hence the computer. - Arithmetic and logic unit (ALU): Performs
the computer data processing functions. - Registers: hold variables or intermediary results of
computation, as well as special purpose registers;,Provides storage internal to the CPU. - CPU
interconnection: Some mechanism that provides for communication among the control unit, ALU and
registers.
-Different functional units in a computer systems and their operations:
Processor
The processor also called the central processing unit (CPU), interprets and carries out the basic
instructions that operate a computer. It controls the operations of the computer and performs its
data processing functions; often simply referred to as processor.
The processor significantly impacts overall computing power and manages most of computer’s
operations. On the larger computers, such as mainframes and supercomputers, the various
functions performed by the processor extend over many separate chips and often multiple circuit
boards. On a personal computer, all functions of the processor usually are on a single chip. Some
computer and chip manufacturers use the term microprocessor to refer to a personal computer
processor chip.
The underlying principles of all computer processors are the same. It does not matter the brand,
age, software or broadband set-up. Fundamentally, they all take signals in the form of binary (0s
and 1s), manipulate them according to a set of instructions, and produce output in the form of
binary. The voltage on the line at time determines whether the signal is a 0 or 1. On a 3.3-volt
system, an application of 3.3-volts means that it’s 1, while an application of 0 volts means it’s a
0.
Processors work by reacting to an input of 0s and 1s in specific ways and then returning an output
based on the decision. The decision itself happens in a circuit called a logic gate, each of which
requires at least one transistor, with the inputs and outputs arranged differently by different
operations. The fact that today’s processors contain millions of transistors offers a clue as to how
complex the logic system is. The processor’s logic gates work together to make using Boolean
Logic, which is based on the algebraic system establish by mathematician George Boole.
The CPU is made up of two main parts; Arithmetic Logic Unit and Control Unit.
The Control Unit (CU)
CU Controls the operations of the CPU and hence the computer.The control unit is the
component of the processor that directs and coordinates most of the operations in the computer.
It interprets each instruction issued by a program and then initiates the appropriate action to carry
out the instruction. i.e the control unit coordinates and manage CPU activities, in particular the
execution of instructions by the arithmetic and logic unit (ALU). Types of internal components
that the control unit directs include the arithmetic/logic unit, registers, and buses.
The functions performed by the control unit vary greatly by internal architecture of the CPU, since
the control unit really implements this architecture of the CPU. On a regular processor that
executes x86 instructions natively, the control unit performs the tasks of fetching, decoding,
managing execution and then storing results.
It also carries out logic operation like comparison of data which may result in different actions. For
example, to determine if an employee should receive overtime pay, software instructs the ALU to
compare the number of hours an employee worked during the week with regular time hours
allowed (e.g., 40 hours). If the hours worked are greater than 40, software instructs the ALU to
perform calculations that compute the overtime wage.
Registers
A processor contains small, high-speed storage locations, called registers that temporarily hold
data and instructions. Registers are part of the processor, not part of memory or a permanent
storage device. Processors have many different types of registers, each with a specific storage
function. Register functions include storing the location from where an instruction was fetched,
storing and instruction while the control unit decodes it, storing data while ALU computes it, and
storing the results of a calculation. signaling the result of a logic operation, or indicating the status
of the program or the CPU itself. Some registers may be accessible to programmers, while others
are reserved for the programmers by the CPU itself. Registers store binary values such as 1 or 0 as
electrical voltages of say 5 volts (or less) or 0 volts. In summary, registers are locations where data
or control information is temporarily stored. It is like a drawer in which you keep your files and
papers.
Cache
Most of today’s computers improve processing times with cache (pronounced cash). Two types of
cache are memory cache and disk cache. Memory cache helps speed the process of the computer
because it stores frequently used instructions and data. Most personal computers today have two
types of memory cache: L1cache and L2 cache. Some also have L3 cache.
L1 Cache – L1 Cache is built directly in the processor chip. L1 cache usually has a very
small capacity, ranging from 8Kb to 12Kb. The more common sizes for personal
computers are 32Kb or 64Kb.
L2 Cache – L2 Cache is slightly slower than L1 Cache but has a much larger capacity,
ranging from 64Kb to 16Mb. When discussing cache, most users are referring to L2 cache.
Current processors include advancedtransfercache (ATC), a type of L2 cache built directly
on the on the processor chip. Processor that use ATC perform at much faster rates than
those that do not use it. Personal computers today typically have from 512Kb to 8Mb of
advanced transfer cache. Servers and workstations have from 8Mb to 16Mb of advanced
transfer cache.
L3 Cache – L3 is a cache on the motherboard that is separate from the processor chip. L3
cache exists only on computer that use L2 advanced transfer cache (ATC). Personal
computers often have up to 8Mb to 24 Mb of L3 cache.
Cache speeds up processing time because it stores frequently used instructions and data.
When the processor needs an instruction or data, it searches memory in this order: L1
cache, then L2 cache, then L3 cache (if it exists), then RAM – with a greater delay in
processing for each level of memory it must search.
CACHE MEMORY
If the instructions or data is not found in memory, then it must search a slower speed
storage medium such as hard disk, CD, or DVD. Windows Vista users can increase the size
of cache through WindowsReadyBoost, which can allocate up to 4GB of removable flash
memory include USB flash drives, Compact Flash cards, and SD (Secure Digital) cards.
Removable flash memory is discussed in more depth later in this chapter and the book.
Address Bus Data Bus Control Bus
Includes
Bus Interface Unit Read/write,
Interrupt, clock and
reset
Internal Bus
Instruction Register
Control Unit Program counter
Stack pointer
Decode Unit
AX BP
Decode Unit
BX SI
DX Flag
General purpose
Register AX is the
Accumulator
t
Figure 2: Typical Microprocessor Architectures
Decode Unit
Buses
A bus is used to transfer information between several different modules. Small and mid-range
computers systems, such as the Macintosh have a single bus connecting all major components.
Supercomputers and other high performance machines have more complex interconnections, but
many components will have internal buses.
Communication on a bus is broken into discrete transactions. Each transaction has a sender and
receiver. In order to initiate a transaction, a module has to gain control of the bus and become
(temporarily, at least) the bus master. Often several devices have the ability to become the master;
for example, the processor controls transactions that transfer instructions and data between
memory and CPU, but a disk controller becomes the bus master to transfer blocks and memory.
When two or more devices want to transfer information at the same time, an arbitration protocol is
used to decide which will be given control first. A protocol is a set of signals exchanged between
devices in order to perform some task, in this case to agree which device will become the bus
master.
Once a device has control the bus, it uses a communication protocol to transfer the information. In
an asynchronous (unclocked) protocol the transfer can begin at any time, but there is some
overheard involved in notifying potential receivers that information needs to be transferred. In a
synchronous protocol transfers are controlled by a global clock and begin only at well-known
times.
Instruction register
When the bus Interface Unit receives an instruction it transfers it to the Instruction Register for
temporary storage. In Pentium processors the Bus Interface Unit transfers instructions to the L1 I-
Cache; there is no instruction register as such.
Stack Pointer
A ‘stack’ is a small area of reserved memory used to store the data in the CPU’s registers when:
(1) system calls are made by a process to operating system routines;
(2) When hardware interrupts generated by input/output (1/0) transactions on peripheral devices;
(4) When a process rescheduling event occurs on foot of a hardware timer interrupt. This transfer
of register contents is called a ‘context switch’
The stack pointer is the register which holds the address of the most recent ‘stack’ entry. Hence,
when a system call is made by a process (to say print a document) and its context is stored on the
stack, the called system routine uses the stack to reload the register contents when it is finished
printing. Thus the process can continue where it left off.
Instruction Decoder
The Instruction Decoder is an arrangement of logic element which act on the bits that constitute
the instruction. Simple instructions with corresponding logic hard-wired into the execution unit are
simply passed to the Execution Unit (and/or the MMX in the Pentium II, III and IV), complex
instructions are decoded so that related microcode modules can be transform from the CPU’s
microcode ROM to the execution unit. The Instruction Decoder will also store referenced on
appropriate so data at the memory locations reference can be fetched.
The phrase Von Neumann architecture derives from a paper written by computer scientist John
Von Neumann in 1945. This describes a design architecture for an electronic digital computer with
subdivisions of a central arithmetic part, a memory to store both data and instructions, external
storage, and input and output mechanisms. The meaning of the phrase has evolved to mean a
stored-program computer. A stored-program digital computer is one that keeps its programmed
instructions, as well as its data, in read-write, random-access memory (RAM). So John Von
Neumann introduced the idea of the stored program. Previously data and programs were stored in
separate memories. Von Neumann realised that data and programs are indistinguishable and can,
therefore, use the same memory. On a large scale, the ability to treat instructions as data is what
makes assemblers, compilers and other automated programming tools possible. One can “write
programs write programs”. This led to the introduction of compilers which accepted high level
language source code as input and produced binary code as output.
The von Neumann architecture uses a single processor which follows a linear sequence of fetch-
decode-execute. In order to do this, the processor has to use some special registers, which are
discrete memory locations with special purposes attached. These are
Register Meaning
PC Program Counter
IR Index Register
The program counter keeps track of where to find the next instruction so that a copy of the
instruction can be placed in the current instruction register. Sometimes the program counter is
called the sequence control Register (SCR) as it controls the sequence in which instructions are
executed.
The current instruction register holds the instruction that is to be executed.
The memory address register is used to hold the memory address counters either the next piece
of data or an instruction that is to be used.
The memory data register acts like a buffer and holds anything that is copied from the memory
ready for the processor to use it.
The central processor contains the arithmetic-logic unit (also known as the arithmetic unit) and the
control unit. The arithmetic-logic unit (ALU) is where data is processed. This involves arithmetic
and logical operations. Arithmetic operations are those that add and subtract numbers, and so on.
Logical operations involve comparing binary patterns and making decisions.
The control unit fetches instructions from memory, decodes them and synchronises the operations
before sending signal to other parts of the computer.
The accumulator is in
the arithmetic unit, the program counter and the instruction registers are in the control unit and the
memory data register and memory address register are in the processor.
An index register is a microprocessor register used for modifying operand addresses during the
run of a program, typically for doing vector/array operations. Index registers are used for a special
kind of indirect addressing (covered in 3.5 (i)) where an immediate constant (i.e. which is part of
the instruction itself) is added to the contents of the index register to form the address to the actual
operand or data.
In conclusion, Virtually all contemporary computer designs are based on concepts developed by
John Von Neumann at the Institute for Advanced Studies Princeton. Such a design is referred to
as the Von Neumann architecture and is base on three key concepts:
• The contents of this memory are addressable by location, without regard to the type of data
contained there.
• Execution occurs in a sequential fashion (unless explicitly modified) from one instruction to
the next There is a small set of basic logic components that can be combined in various ways to
store binary data and to perform arithmetic and logical operations on that data.
A von Neumann machine has only a single path between the main memory and the control unit
(CU). This feature/constraint is referred to as the von Neumann bottleneck.
First problem is that every piece of data and instruction has to pass across the data bus in order to
move from main memory into the CPU (and back gain). This is a problem because the data bus is a
lot slower than the rate at which the CPU can carry out instructions. This is called the ‘Von
Neumann bottleneck’. If nothing were done, the CPU would spend most of its time waiting around
for instructions. A special kind of memory called a ’Cache’ (pronounced ‘cash’) is used to tackle
with this problem. Os tries to fetch block of memory to cache, in a wake to fetch further required
instruction or data beforehand.
Second problem is both data and programs share the same memory space.
This is a problem because it is quite easy for a poorly written or faulty piece of code to write data
into an area holding other instructions, so trashing that problem.
Another problem is that the rate at which data needs to be fetched and the rate at which
instructions need to be fetched are often very different. And yet they share the same bottlenecked
data bus. To solve the problem idea of the Harvard Architecture is considered that to split the
memory into two parts. One part for data and another part for programs. Each part is accessed with
a different bus. This means the CPU can be fetching both data and instructions at the same time.
There is also less chance of program corruption. This architecture is sometimes used within the
CPU to handle its caches, but it is less used with Ram because of complexity and cost.
Von Neumann architecture is a sequential processing machine but if we could process more than
one piece of data at the same time? This would dramatically speed up the rate at which processing
could occur. This is the idea behind ‘parallel processing’. Parallel processing is the simultaneous
processing of data. There are a number of ways to carry out parallel processing; the table below
shows each one of them and how they are applied in real life.
Faster when handling large amounts of data, with each data set requiring the same
processing (array and multi-core methods)
Is not limited by the bus transfer rate (the Von Neumann bottleneck)
Can make maximum use of the CPU (pipeline method) in spite of the bottleneck
Disadvantages
Only certain types of data are suitable for parallel processing. Data that relies on the result
of a previous operation cannot be made parallel. For parallel processing, each set must be
independent of each other.
More costly in terms of hardware-multiple processing blocks needed, this applies to all
three methods.
Pipelining
Using the Von Neumann architecture for a microprocessor illustrates that basically an instruction
can be in one of three phases. It could be being fetched (from memory). Decoded (by the control
unit) or being executed (by the control unit). An alternative is to split the processor up into three
parts, each of which handles one of the three stages, where each single line is a pipeline.
Instruction 2Instruction 1
Fig 3.3.d.1
This helps with the speed of throughput unless the next instruction in the pipe is not the next one
that is needed. Suppose instruction 2 is a jump to instruction 10. Then instructions 3, 4 and 5 need
to be removed from the pipe and instruction 10 needs to be loaded into the fetch part of the pipe.
Thus, the pipe will have to be cleared and the cycle restarted in this case. The result is shown in
Fig.3.3.2 below
Fetch Decode Execute
Instruction 1
Instruction 2Instruction 1
Instruction 10
Instruction 11 Instruction 10
Fig.3.3.d.2
The effect of pipe lining is that are three instructions being dealt with at the same time.
This SHOULD reduce the execution times considerably (to approximately 1/3 of the standard
times), however, this would only be true for a very linear program. Once jump instructions are
introduced the problem arises that the wrong instructions changes, the pipe line to be cleared and
the process started again.
INSTRUCTION PIPELINING
Pipelining is the process of laying the production process out in an assembly line, so that the
products at various stages can be worked on simultaneously. Pipeline has two independent
stages. The first stage fetches an instruction and buffers it while the second stage is executing the
instruction, the first stage takes advantage of any unused memory cycles to fetch and buffer the
next instruction. This is called instruction prefetch or fetches overlap. This process helps to speed
up instruction execution.
TWO – STAGE INSTRUCTION PIPELINE
A. SIMPLIFIED VIEW
NEW ADDRESS
WAIT WAIT
DISCARD
B. EXPANDED VIEW
One of the major problems in designing an instruction pipeline is assuring a steady flow of
instructions to the initial stages of the pipeline the primary impediment is the conditional branch
instruction.
1. Multiple streams:- A simple pipeline suffers a penalty for a branch instruction because it
must choose one of two instructions to fetch next and may make the wrong choice. The
solution to this is to provide multiple pipelines that will fetch both instructions. This leads to
delay problem but is a strategy that improve performance. Examples of machines with two or
more pipeline streams are IBM 370/168, IBM 3033.
2. Prefetch branch target: i.e. the target of the branch is prefetched in addition to the instruction
following the branch. This target is then saved until the branch instruction is executed. The
IBM 360/91 uses this approach.
3. Loop Buffer:- is a small, very-high speed buffer/memory that contain the most recently
fetched instruction, in sequence. If a branch is to be taken, the branch target is first checked
whether it is within the buffer. If so, the next instruction is fetched from the buffer. These
machines include star- 100, 6600, 7600 (CDC machines) and the CRAY-l.
4. Delayed Branch:- This improved the pipeline performance by automatically rearranging
instructions within a program, so that branch instructions occur later than actually desired.
1. FETCH:- Instructions are fetched from the catche/external memory and placed into one of
the two 16-byte prefetch buffers.
2. DECODE STAGE 1:- All opcode & addressing mode information is decoded.
3. DECODE STAGE 2:- This controls the computation of the more complex addressing
modes.
4. EXECUTE:- This stage includes ALU operations, caches access, and register update.
5. WRITE BACK:- This stage, if needed updates registers and status flags modified last. If
current instruction updates memory, the computed value is sent to cache and to the bus-
interface. It also write buffers at the same time.
An array processor (or vector processor) has a number of Arithmetic Logic Unit (ALU) that allows
all the elements of an array to be processed at the same time.
The illustration Fig.3.3.d.3 below shows the architecture of an array or vector processor.
ALU ALU
3 4
Data Data
CONTROL UNIT
Fig.3.3.d.3
With an array processor, a single instruction is issued by a control unit and that instruction is
applied to a number of data sects at the same time.
Limitations
This architecture relies on the fact that the data sets are all acting on a single instruction. However,
if these data sets somehow rely on each other than you cannot apply parallel processing. For
example, if data A has to be processed before data B then you cannot do A and B simultaneously.
This dependency is what makes parallel processing difficult to implement. And it is why sequential
machine are still extremely common.
Multiple processors:
Moving on from an array processor, where a single instruction acts upon multiple data set and the
next level of parallel processing is to have multiple instructions acting upon multiple data sets.
This is achieved by having a number of CPUs being applied to a single problem, with each CPU
carrying out only part of the overall problem.
The advantages of having a co-processor is that calculation (and hence performance) is much
faster. The disadvantages are that it is more expensive, requires more motherboard space and tasks
more power.
But if the computer is dedicated to handling heavy floating point work than it may be worth it. For
instance a computer within a single processing card in a communication system may include a
maths co-processor to process the incoming data as quickly as possible.
INSTRUCTION CYCLE: Obviously there are at least two steps in the cycle of an instruction:
fetch (i.e. the instruction is brought into CPU, more precisely into IR) and execute. An
instruction cycle consists of the activities required to fetch and execute an instruction.
Instruction Cycle: The processing required for a single instruction is called an instruction cycle. It
is made up of two stages:
i. The Fetch Cycle- the processor reads/fetches instructions from memory, one at a time.
ii. The Execute Cycle – the processor executes each instruction.
Program execution stops only if the machine is turned off, some sort of unrecoverable error occurs
or a program instruction that stops the computer is encountered.
Instruction Fetch and Execute: At the beginning of each instruction cycle, the processor
fetches an instruction from memory. In a typical processor, a register called the program counter
(PC) holds the address of the instruction to be fetched next. Unless told otherwise, the processor
always increments the PC after each instruction fetch so that it will fetch the next instruction in
sequence (i.e., the instruction located at the next higher memory address). So, for example,
consider a computer in which each instruction occupies one 16-bit word of memory. Assume that
the program counter is set to location 300. The processor will next fetch the instruction at
location 300. On succeeding instruction cycles, it will fetch instructions from locations 301, 302,
303, and so on. This sequence may be altered, as explained presently. The fetched instruction is
loaded into a register in the processor known as the instruction register (IR). The instruction
contains bits that specify the action the processor is to take. The processor interprets the
instruction and performs the required action.
The instruction contains bits that specify the action the processor is to take. The processor
interpretes the instruction and performs the required action. The action falls into categories of four
i) Processor- Memory:- data may be transferred from processor to memory or from memory to
processor.
ii) Processor I/O:- Data may be transferred to the processor and an I/O module.
iii) Data processing:- the processor may perform some arithmetic/ logic operation on data.
iv) Control:- An instruction may specify the sequence of instruction to be altered. Eg the
specification that the next memory instruction is from location 182 while the current specification
is saying 149.
INTERRUPTS
Interrupts is an external event to the currently process that cause a change in the normal flow of
instruction execution generated by hardware devices external to the CPU. When interrupts is
initiated in the I/O, the I/O interface monitors the I/O device instead of the CPU. When the
interface finds the I/O device is ready for data transfer, it generates an interrupt request to the CPU
upon detecting an interrupt, the CPU stops momentarily the task it is doing and then returns to the
task it was performing. That is if the execute instruction consists of an I/O operation, the processor
will be suspended waiting for the completion of I/O transfer, instead of the processor executing
some other programs.
Other devices that connected with the PC are keyboard, mouse screen , disk drive, scanner, printer,
sound card, camera etc. each of the devices has the interrupt line, that it can use to signal to the
processor executes a routine called an interrupt handler to deal with the interrupt. Thus, interrupts
are assigned priorities to handle simultaneous interrupts.
Interrupts can be generated from the hardware/software:
Hardware Interrupts: are used by devices to communicate that they require attention from OS.
Hardware interrupts are implemented using electronic alerting signals that are sent to the processor
from an external device i.e pressing a key on keyboard/moving the mouse triggers hardware
interrupts that make the processor to read the keystroke/mouse position.
Software Interrupt: is caused by exceptional condition in the processor itself/a special instruction
in the instruction set. For example, computer use software interrupt instructions to communicate
with the disk controller to require data to be read/written to the disk.
Reduced Instruction Set Computers (RISC) Architecture Versus Complex
instruction Set Computers (CISC) Architecture
Instruction Set Architecture: “Instruction Set Architecture is the structure of a computer that a
machine language programmer (or a compiler) must understand to write a correct (timing
independent) program for that machine. The ISA defines:
Two principal reasons have motivated the use of RISC and CISC:
The first of the reasons cited, compiler simplification, seems obvious. The task of the compiler
writer is to generate a sequence of machine instructions that is well simplified. While complex
machine instructions are often hard to exploit because the compiler must find those cases that
exactly fit the construct.
The other major reason cited is the expectation that a CISC will perform better. That is,
programs will be smaller and that they will execute faster. There are two advantages to smaller
programs. First, because the program takes up less memory. Also, smaller programs should
improve performance, and this will happen in two ways. First, fewer instructions means fewer
instruction bytes to be fetched. Second, in a paging environment, smaller programs occupy fewer
pages, reducing page faults.
Also, because there are more instructions on a CISC, longer opcodes are required, producing
longer instructions
The first characteristic listed is that there is one machine instruction per machine cycle. A
machine cycle is defined to be the time it takes to fetch two operands from registers, perform an
ALU operation, and store the result in a register. Thus, RISC machine instructions should be no
more complicated than, and execute about as fast as, microinstructions on CISC machines
(discussed in Part Four). With simple, one-cycle instructions, there is little or no need for
microcode; the machine instructions can be hardwired. Such instructions should execute faster
than comparable machine instructions on other machines, because it is not necessary to access a
microprogram control store during instruction execution. A second characteristic is that most
operations should be register to register. with only simple LOAD and STORE operations
accessing memory. This design feature simplifies the instructio4Lset and therefore the control
unit. For example, a RISC instruction set may include only one or two ADD instructions (e.g.,
integer add, add with carry); the VAX has 25 different ADD instructions. Another benefit " that
such an architecture encourages the optimization of register use, so that frequently accessed
operands remain in high-speed storage. This emphasis on register-to-register operations is
notable for RISC design Contemporary CISC machines provide such instructions but also
include memory to memory and mixed register/memory operations. Attempts to compare these
approaches were made in the 1970s, before the appearance of RISCs. illustrates the approach
taken. Hypothetical architectures were evaluated on program size and the number of bits of
memory traffic. Results such as this one led researcher to suggest that future architectures should
contain no registers a: [MYER78]. One wonders what he would have thought, at the time, of the
RISC machine once produced by Pyramid, which contained no less than 528 register.A third
characteristic is the use of simple addressing modes. Almost all RISC instructions use simple
register addressing.
4. No indirect addressing that requires you to make one memory access to get address of another
operand in memory.
5. No operations that combine load/store with arithmetic (e.g., add from mem-add to memory).
8. Maximum number of uses of the memory management unit (MMU) for a c: address in an
instruction. 9. Number of bits for integer register specified equal to five or more. This means that
at least 32 integer registers can be explicitly referenced at a time.
10. Number of bits for floating-point register specifier equal to four or more. This means that at
least 16 floating-point registers can be explicitly referenced at a time.
CISC RISC
Complex, powerful instructions Simple hard-wired machine code and control unit
Numerous memory addressing options for operands Compiler and IC developed simultaneously
DESIGN OF CONTROL UNIT
The basic functional elements of the processor are, the ALU, the registers (store data internal to
the processor), Internal data path (used to move data between registers, and between register and
ALU; External data path (links registers to memory and I/O modules) and the control unit causes
operations to happen within the processor. Also, the control unit performs two basic tasks:-
a. Sequencing:- The control unit causes the processor to step through a series of micro-
operations in the proper sequence, based on the program being executed.
b. Execution:- The control unit causes each micro-operation to be performed.
Instruction
Register
Ctrl Signals
from with CPU
.
Flags .
. CONTROL Ctrl bus
Clocks
UNIT Ctrl Signals
from Ctrl bus
Ctrl Signals to
Ctrl bus
MICRO INSTRUCTIONS
To implement a control unit with may basic logic elements, the design must include logic for
sequencing through micro-operations, for executing micro-operations, for interpreting opcodes
and for making decisions based in ALU flags.
Implementation of a micro programmed control unit involved the use of a symbolic notation
known as a microprogramming language. Each line describes a set of micro-operations occurring
at one time and is known as a microinstruction. A sequence of instructions is known as a micro
program or firmware. A micro program is midway between: hardware and software. It is easier
to design in firmware than hardware; it is more difficult to write a firmware program than a
software program.
Advantages of Microprogramming
- It simplify the design of control unit
- It is cheaper
- It is less error prone
Disadvantages of Microprogramming
- It is slower than a hardwired unit
NB: CISC use microprogramming while RISC use hardwired control unit
The control unit contains a program that describes the behavior its behaviors. It has the following
basic elements:-
a. Control address register
b. Control memory
c. Control buffer register
The control memory – has the set of micro instructions that is stored in it.
The control address register – contains the address of the next micro instruction to be read.
The control buffer register – a micro instruction that is read from the control memory, is
transferred into a control buffer register.
A micro programmed control unit is an alternative to a hardwired control unit. It is implemented
by a microprogram.
A micro program is a sequence of instructions in a microprogramming language.
Micro-instruction sequencing
These are the two basic tasks performed by a micro programmed control unit:-
i. Micro instruction sequencing: get the next micro instruction from the control memory.
ii. Micro instruction execution: generate the control signals needed to execute the micro
instruction.
INSTRUCTION REPRESENTATION
Within the computer, each instruction is represented by a sequence of bits. The instruction is
divided into fields, corresponding to the constituents elements of the instruction.
Opcodes are represented by mnemonics, that indicates the operation to be performed on the
operand e.g.
ADD add
SUB subtract
MUL multiply
DIV Divide
where Y is the address of a location in memory. (1) above mean add the value contained in data
location Y to the contents of register R.
Computer memory is organized into a hierarchy. At the highest level (closest to the processor)
are the processor registers. Next to this include one or more levels of cache, when multiple levels
are used, they are denoted L1, L2, and so on. As one goes down the memory hierarchy one finds
decreasing cost bit, increasing capacity, and slower access time.
The term “location in” refers to whether memory is internal and external to the computer.
Internal memory is often equated with main memory. But there are other forms of internal
memory. The processor requires its own local memory, in the form of registers, the control unit
portion of the processor may also require its own internal memory. Cache is another form of
internal memory. External memory consists of peripheral storage devices, such as disk and tape,
that are accessible to the processor via I/O controllers.
An obvious characteristic of memory is its capacity. For internal memory, this is typically
expressed in terms of bytes (1 byte = 8 bits) or words. Common word lengths are 8, 16, and 32
bits.
Word: The "natural" unit of organization of memory. The size of the word is typically equal to
the number of bits used to represent an integer and to the instruction length.
Unit of transfer: For main memory, this is the number of bits read out of or written into memory
at a time. The unit of transfer need not equal a word or an addressable unit. For external memory,
data are often transferred in much larger units than a word, and these are referred to as blocks.
Another distinction among memory types is the method of accessing units of data. These
include the following:
Sequential access: Memory is organized into units of data, called records. Access must be made
in a specific linear sequence. Stored addressing information is used to separate records and assist
in the retrieval process. In sequential access, a shared read write mechanism is used, and this
must be moved from its current location to the desired location, passing and rejecting each
intermediate record. Thus, the time to access an arbitrary record is highly variable.
Direct access: As with sequential access, direct access involves a shared read-write mechanism.
However, individual blocks or records have a unique address based on physical location. Access
is accomplished by direct access to reach a general vicinity plus sequential searching, counting,
or waiting to reach the final location.
Random access: Each addressable location in memory has a unique, physically wired-in
addressing mechanism. The time to access a given location is independent of the sequence of
prior accesses and is constant. Thus, any location can be selected at random and directly
addressed and accessed. Main memory and some cache systems are random access.
Associative: This is a random access type of memory that enables one to make a comparison of
desired bit locations within a word for a specified match, and to do this for all words
simultaneously. Thus, a word is retrieved based on a portion of its contents rather than its
address. As with ordinary random-access memory, each location has its own addressing
mechanism, and retrieval time is constant independent of location or prior access patterns. Cache
memories may employ associative access. From a user's point of view, the two most
important characteristics of memory are capacity and performance.
Access time (latency): For random-access memory, this is the time it takes to perform a read or
write operation, that is, the time from the instant that an address is presented to the memory to
the instant that data have been stored or made available for use. For non-random-access memory,
access time is the time it takes to position the read-write mechanism at the desired location.
Memory cycle time: This concept is primarily applied to random-access memory and consists of
the access time plus any additional time required before a second access can commence. This
additional time may be required for transients to die out on signal lines or to regenerate data if
they are read destructively. Note that memory cycle time is concerned with the system bus, not
the processor.
Transfer rate: This is the rate at which data can be transferred into or out of a memory unit. For
random-access memory, it is equal to 1/(cycle time). For non-random-access memory, the
following relationship holds: Tx = TA + R where TN = Average time to read or write N bits TA
= Average accesstime n = Number of bits R = Transfer rate, in bits per second (bps)
A variety of physical types of memory have been employed. The most common ones today are
semiconductor memory, magnetic surface memory, used for disk a tape, and optical and
magneto-optical. Several physical characteristics of data storage are important. In a volatile
memory, information decays naturally or is lost when electrical power is switched off. In a
nonvolatile memory, information once recorded remains without deteriorate until deliberately
changed; no electrical power is needed to retain information.Magnetic-surface memories are
nonvolatile. Semiconductor memory may be - volatile or nonvolatile. Nonerasable memory
cannot be altered, except by destroy the storage unit. Semiconductor memory of this type is
known as read-only mere (ROM). Of necessity, a practical nonerasable memory must also be
nonvolatile. For random-access memory, the organization is a key design issue.
The design constraints on a computer's memory can be summed up by three question: How
much? How fast? How expensive? The question of how much is somewhat open ended. If the
capacity is applications will likely be developed to use it. The question of how fast is, in a easier
to answer. To achieve greatest performance, the memory must be able to up with the processor.
That is, as the processor is executing instructions, we not want it to have to pause waiting for
instructions or operands. The final question must also be considered. For a practical system, the
cost of memory must be able in relationship to other components. As might be expected, there is
a trade-off among the three key character of memory: namely, capacity, access time, and cost. A
variety of technologic used to implement memory systems, and across this spectrum of
technology following relationships hold: Faster access time, greater cost per bit Greater capacity,
smaller cost per bit Greater capacity, slower access time The dilemma facing the designer is
clear. The designer would like to use memory technologies that provide for large-capacity
memory, both because the car is needed and because the cost per bit is low. However, to meet
performance requirements, the designer needs to use expensive, relatively lower-capacity
memories with short access times. The way out of this dilemma is not to rely on a single memory
component or technology, but to employ a memory hierarchy. A typical hierarchy is illustrated in
the Figure I the figure below. As one goes down the hierarchy, the following occur:
Thus, smaller, more expensive, faster memories are supplemented by larger, cheaper,
slower memories. The key to the success of this organization is item (d): decreasing
frequency of access.
FIGURE I
The fastest, smallest, and most expensive type of memory consists of the registers internal to the
processor. Typically, a - processor will contain a few dozen such registers, although some
machines contain hundreds of registers. Skipping down two levels, main memory is the principal
internal memory system of the computer. Each location in main memory has a unique _address.
Main memory is usually extended with a higher-speed, smaller cache. The cache is not usually
visible to the programmer or, indeed, to the processor. It is a device for staging the movement of
data between main memory and processor registers to improve performance. The three forms of
memory just described are, typically, volatile and employ semi-conductor technology. The use of
three levels exploits the fact that semiconductor memory comes in a variety of types, which
differ in speed and cost. Data are stored more permanently on external mass storage devices, of
which the most common are hard disk and removable media, such as removable magnetic disk,
tape, and optical storage. External, nonvolatile memory is also referred to as secondary memory
auxiliary memory. These are used to store program and data files and are usually invisible to the
programmer only in terms of files and records, as opposed to individual bytes or words. Other
forms of memory may be included in the hierarchy. Other forms of secondary memory include
optical and magneto-optical disks. Finally, additional levels can be positively added to the
hierarchy in software. A portion of main memory can be used as a buffer to hold data
temporarily that is to be read out to disk. Such a technique, sometimes referred to as a disk
cache,
ERROR CORRECTION: Error correction techniques are commonly used in memory systems.A
semiconductor memory system is subject to errors. These can be categorized as hard failures and
soft errors. A hard failure is a permanent physical defect so that the memory cell or cells affected
cannot reliably store data but become stuck at 0 or 1 or switch erratically between 0 and 1. Hard
errors can be caused by harsh environmental abuse, manufacturing defects, and wear. A soft
error is a random, nondestructive event that alters the contents of one or more memory cells
without damaging the memory. Soft errors can be caused by power supply problems or alpha
particles. These particles result from radioactive decay and are distressingly common because
radioactive nuclei are found in small quantities in nearly all materials. Both hard and soft errors
are clearly undesirable, and most modern main memory systems include logic for both detecting
and correcting errors. An error is detected, and it is possible to correct the error. The data bits
plus error correction bits are fed into a corrector, which produces a corrected set of M bits to be
sent out. An error is detected, but it is not possible to correct it. This condition is reported. Codes
that operate in this fashion are referred to as error-correcting codes. A code is characterized by
the numbs of bit errors in a word that it can correct and detect. That is, the error correction
technique involves adding redundant bits that are a function of the data bit to form an error
correction code. If a bit error occurs, the code will detect and usually correct the error
LOW-LEVEL PARALLELISM AND ITS IMPLEMENTATION IN A PROCESSOR.
At the micro- operation level multiple control signals are generated almost at the same time.
Instruction pipelining, at least to the extent of overlapping fetch and execute operations, has been
around for a long time. Both of these are examples of performing functions in parallel. This
approach is taken further to instruction- level parallelism. With a super scalar machine, there are
multiple instructions for the same program in parallel. As computer technology has evolved and
as the cost of computer hard ware has dropped computer designers have sought more and more
opportunities for parallelism usually to enhance performance and in some cases to increase
availability.
Single instruction, single data (SISD) stream: A single processor executes a single instruction
stream to operate on data stored in a single memory. Uniprocessors fall into this category.
Single instruction, multiple data (SIMD) stream: A single machine instruction controls the
simultaneous execution of a number of processing elements on a lock step basis. Each processing
element has associated data memory so that each instruction is executed on a different set of data
by the different processors. Vector and array processors fall into this category
Multiple instruction, single data (MISD) stream: A sequence of data is transmitted to a set of
processors, each of which executes a different instruction sequence. This structure is not
commercially implemented. Multiple instructions, multiple data (MIMD) stream: A sequence of
data transmitted to a set of processors, each of which executes a different instruction sequence.
This structure is none commercially implemented.