Computer Architecture Study Guide (Draft - Pending Final Review) PDF
Computer Architecture Study Guide (Draft - Pending Final Review) PDF
ARCHITECTURE
1st
2nd
0 | P BSc_CA_621
age
Registered with the Department of Higher Education as a Private Higher Education Institution under the Higher Education Act,
1997.
Registration Certificate No. 2000/HE07/008
LEARNER GUIDE
MODULES: Computer Architecture [1st Semester]
PREPARED ON BEHALF OF
1|Page
TABLE OF CONTENTS
TOPICS
Page No.
SECTION A: PREFACE
3-11
1. Welcome
2. Title of Modules
3. Purpose of Module
4. Learning Outcomes
5. Method of Study
7. Notices
&
Key
Concepts
5-6
in
Assignments
and
6-8
Examinations
10.Specimen Assignment Cover Sheet
10
11
12-105
2|Page
SECTION A: PREFACE
1. WELCOME
Welcome to the Department of Media Information and Communication
Technology at PC Training and Business College. We trust you will find the
contents and learning outcomes of this module both interesting and insightful as
you begin your academic journey and eventually your career in the Information
Technology realm.
This section of the study guide is intended to orientate you to the module before
the commencement of formal lectures.
The following lectures will focus on the common study units described:
SECTION A: WELCOME & ORIENTATION
Study unit 1: Orientation Programme
Introducing academic staff to the learners by Academic Programme
Manager. Introduction of Institution Policies.
Study unit 2: Orientation of Learners to Library and Students
Facilities
Lecture 1
Lecture 2
Lecture 3
Lecture 4
3|Page
Title Of Module:
Code:
NQF Level:
Credits:
Mode of Delivery:
BSc COMPUTER
SCIENCE
Computer Architecture
CA
6
10
Contact
3. PURPOSE OF MODULE
3.1 Computer Architecture
This course provides an overview of the organization and architecture of
general purpose computers. Learners will be exposed to several new aspects of
programming including hardware, embedded and register programming. A
variety of illustrative examples will be taken from across various Computer
Architecture platforms in order to entrench the required body of knowledge that
the syllabus seeks to establish at this level. The syllabus has been designed to
ensure that each and every chapter has a correlation with the next.
4. LEARNING OUTCOMES
On completion of this module, learners should be able to:
Define and explain Computer Architecture and Organization concepts
including functional components and their characteristics, performance
and the detailed interactions in computer systems including the system
bus, different types of memory, input/output as well as the CPU.
Employ Computer Architecture theory to solve the basic functional
hardware programming and organizational problem.
Assemble basic computer components.
Select and use the appropriate hardware and software tools for systems
integration.
5. METHOD OF STUDY
Only the key sections that have to be studied are indicated under each topic in
this study guide AND learners are expected to have a thorough working
knowledge of the specified/referenced sections of the prescribed text book.
These form the basis for tests, assignments and examinations. To be able to
complete the activities and assignments for this module, and to achieve the
4|Page
Prescribed Material:
NB: Learners please note that there will be a limited number of copies of the
recommended texts and reference material that will be made available at your
campus library. Learners are advised to make copies or take notes of the
relevant information, as the content matter is examinable.
5|Page
8.3.
Independent Research:
Library Infrastructure
6|Page
Knowledge
Comprehension
Application
Analysis
Synthesis
SKILLS DEMONSTRATED
Observation and recall of information
Knowledge of dates, events, places
Knowledge of major ideas
Mastery of subject matter
Question
Cues
list, define, tell, describe, identify, show, label, collect, examine, tabulate,
quote, name, who, when, where, etc.
Understanding information
Grasp meaning
Translate knowledge into new context
Interpret facts, compare, contrast
Order, group, infer causes
predict consequences
Question
Cues
summarize, describe, interpret, contrast, predict, associate, distinguish,
estimate, differentiate, discuss, extend
Use information
Use methods, concepts, theories in new situations
Solve problems using required skills or knowledge
Questions
Cues
apply, demonstrate, calculate, complete, illustrate, show, solve, examine,
modify, relate, change, classify, experiment, discover
Seeing patterns
Organization of parts
Recognition of hidden meanings
Identification of components
Question
Cues
analyze, separate, order, explain, connect, classify, arrange, divide,
compare, select, explain, infer
Use old ideas to create new ones
Generalize from given facts
Relate knowledge from several areas
Predict, draw conclusions
Question
Cues
combine, integrate, modify, rearrange, substitute, plan, create, design,
invent, what if?, compose, formulate, prepare, generalize, rewrite
7|Page
Evaluation
The examination department will make available to you the details of the
examination (date, time and venue) in due course. You must be seated in the
examination room 15 minutes before the commencement of the examination. If
you arrive late, you will not be allowed any extra time. Your learner registration
card must be in your possession at all times.
9.4. Final Assessment
The final assessment for this module will be weighted as follows:
Continuous Assessment Test 1
Continuous Assessment Test 2
40 %
Assignment 1
Total Continuous Assessment
40%
Semester Examinations
60%
Total
100%
9.5. Key Concepts in Assignments and Examinations
In assignment and examination questions you will notice certain key concepts
(i.e. words/verbs) which tell you what is expected of you. For example, you
may be asked in a question to list, describe, illustrate, demonstrate, compare,
construct, relate, criticize, recommend or design particular information / aspects
/ factors /situations. To help you to know exactly what these key concepts or
verbs mean so that you will know exactly what is expected of you, we present
the following taxonomy by Bloom, explaining the concepts and stating the level
of cognitive thinking that these refer to.
8|Page
9|Page
SOFT SKILLS
Time Management
Working in Teams
Problem Solving Skills
Attitude & Goal Setting
Etiquette & Ethics
Communication Skills
WORK
READINESS
PROGRAMME
EMPLOYMENT SKILLS
CV Writing
Interview Skills
Presentation Skills
Employer / Employee Relationship
End User Computing
Email & E-Commerce
Spread Sheets
Data base
Presentation
Office Word
10 | P a g e
11 | P a g e
SECTION B
Registered with the Department of Higher Education as a Private Higher Education Institution under the Higher Education Act,
1997.
Registration Certificate No. 2000/HE07/008
LEARNER GUIDE
MODULE: Computer Architecture (2ND SEMESTER)
TOPIC 1 :
TOPIC 2 :
TOPIC 3 :
TOPIC 4 :
TOPIC 5 :
TOPIC 6 :
12 | P a g e
Organization
16
1.1
16
1.2
17
Lecture 6
21
2.1
21
2.2
25
2.3
25
2.4
Later Generations
26
2.5
Semiconductor Memory
26
Lectures 10-11
2.6
Microprocessors
27
Lectures 12-14
2.7
29
2.8
Microprocessor Speed
29
2.9
30
2.10
32
2.11
ARM Evolution
32
Lectures 7-9
Lecture 16
35
3.1
Addressing Modes
35
3.2
Addressing Formats
37
3.3
38
Lectures 17-19
Lecture 20-21
Metrics
3.4
41
3.5
43
3.6
43
46
4.1
Why Binary
46
4.2
46
4.3
47
4.4
Binary Addition
49
4.5
Binary Subtraction
50
4.6
Binary Multiplication
51
4.7
Binary Division
51
Lecture 22-23
Lecture 24
Lecture 25
13 | P a g e
4.8
Digital Logic
53
Lecture 26
4.9
Boolean Algebra
56
Lectures 27-29
4.10
57
4.11
60
Lectures 30-31
62
5.1
Basic Model
63
5.2
Cache Architecture
64
5.3
Write Policy
66
5.4
Cache Components
67
Lecture 33
5.5
Cache Organization
68
Lecture 34
5.6
68
5.7
69
5.8
69
5.9
70
5.10
71
5.11
Operating Modes
71
5.12
Cache Consistency
71
5.13
Internal Memory
72
5.14
72
5.15
Types Of ROM
73
5.16
External Memory
75
5.17
Disk Characteristics
75
5.18
RAID Technology
75
5.19
Optical Disks
76
5.20
Magnetic Tape
76
5.21
76
5.22
Input / Output
77
5.23
Buses
77
5.24
Programmed I/O
78
5.25
Interrupt-Driven I/O
78
5.26
80
Lecture 32
Lecture 35-36
Lecture 37
Lecture 38
Lecture 39-41
Lecture 42-43
14 | P a g e
83
6.1
85
Lecture 44
6.2
Pipeline Limitations
85
6.3
87
6.4
87
6.5
88
6.6
A Look At CISC
89
Lecture 46
6.7
89
Lecture 47-48
6.8
RISC Pipelining
90
6.9
90
6.10
Super-pipelining
91
6.11
91
Lecture 45
Lecture 49-50
Processors
15 | P a g e
16 | P a g e
Operating Environment
(Source and Destination of data)
Data
Movement
Apparatus
Data Storage
Facility
Control
Mechanism
Data
Processing
Facility
17 | P a g e
Both the structure and function of a computer system are very simple, and Figure 1.1 shows in a
very simplified way that the computer performs four basic functions:
Data processing
Data storage
Data Movement, and
Control.
Essentially, a computer must process data and there are many reasons why data may be processed
and these data also take a wide variety of forms from employee details and student details for
example, which must be processed to produce correspondingly ordered records. These records,
which at this stage reflect useful consumable information are then stored. The computer system
implements storage in either of two ways. While the computer is still processing data, some of the
intermediate results which are going to be needed as input in subsequent processing stages will be
stored in a short-term data storage functionary called Random Access Memory [RAM]. This type of
memory is temporary and whatever is in RAM at the point the system shuts down will be
irretrievably lost. For this reason, RAM is said to be volatile. The other very important aspect of
storage arises when the results of data processing [information] are stored permanently on the
computer or on external devices such as flash drives, DVDs or any of the many available online
storage services. Most organizations use online storage services as an outsourced backup
alternative.
The computer must have the capacity to move data between itself and the external world. The
computers operational environment consists of devices which serve either as sources or
destinations of data. When data is received from or delivered to a device that is directly connected
to a computer, the process is known as input/output [I/O] and the device so involved is known as a
peripheral. When data are moved over long distances to and from a remote device, the process is
known as Data Communications.
Ultimately, there must be Control of all these three functions of a computer system and this control
within the computer is provided by a component called the Control Unit which manages the
computers resources and orchestrates the performance of the functional parts of a computer
system in response to user instructions which in many cases are in the form of some program.
Now moving onto the Structure of a computer system, Figure 1.2 reveals a very simplified portrayal
of the computer system. All computers will invariably interact with the external environment and
they generally achieve this through peripherals and linkages which we shall term communication
lines.
Peripherals
Peripherals
COMPUTER
Storage
Processing
Communication Lines
Communication Lines
18 | P a g e
Though we have this depiction of the Computer System in the above figure, of greater importance to
us is the internal structure itself, of which the CPU is shown in Figure 1.3 below. There are four main
structural components in the computer, viz:
The Central Processing Unit [CPU]: This extremely important component is interchangeably
referred to as the processor and is the heart and the center of all manner of control on all
operations in the performance of data processing functions by the system.
Main Memory: Which pretty much is the RAM, stores [intermediate] data during processing
as earlier alluded to.
I/O: These functionaries move data between the computer and its external environment.
System Interconnection: This refers to the mechanism which provides communication
among the CPU, Main Memory and I/O. A common example of system interconnection is by
means of a System Bus consisting of a number of conducting wires to which all other
components attach.
CPU
Arithmetic
Logic Unit
Sequencing Logic
Main
Memory
I/O
Control
Unit
Registers
Control Memory
19 | P a g e
Considering the four components of the computer system, the most complex and the most
interesting is the CPU whose major components are:
The Control Unit [CU]: Which controls the operation of the CPU and hence the computer as
a whole
The Arithmetic Logic Unit [ALU]: It performs the computers data processing functions
Registers: These provide intermediate storage which is internal to the CPU
CPU Interconnection: It exists in the form of a CPU internal bus (represented by block
arrows in Figure 1.3) and provides communication between the Control Unit, the ALU and
the registers.
There it is! We have come to the end of this first part of our module. Proceed to answer all of the
following painless questions and after satisfying yourself that you have adequately answered the
questions, you may turn to the next section of the guide.
Think Point: Before Attempting the Self-Assessment Exercise below, take time to go
through the prescribed book and read pages 26-32:
Stallings, W., (2013), Computer Organization And Architecture, 8e. Pearson Education, NJ.
Pay particular attention to the distinction that the text makes between Computer Architecture and
Computer Organization.
Try other similar texts and compare how following terms are defined and explained:
Computer Architecture
Computer Organization
CPU Structure
CPU Function
1.0
1.1
1.2
1.3
1.4
1.5
Self-Assessment Exercise
What in general terms is the distinction between Computer Organization and Computer
Architecture?
Explain the relationship between Computer Organization and Computer Architecture across
different models from the same manufacturer or vendor.
What in general terms is the distinction between computer structure and computer function?
What are the four main functions of a computer?
List and in each case briefly define the main structural components of a computer.
List and briefly define the main structural components of a CPU.
20 | P a g e
At this point, the swimming toward the deep end begins. We begin by looking very briefly at
computer evolution and this is a subject which you have no doubt encountered already in the very
formative stages of your Information Technology studies. We look at the same time at how
computers performed throughout the evolution up until now, and note the similarities and
differences.
From the very first known computer up until now, the evolution of computers has been
characterized by increasing processor speeds, reduction in component size, increasing memory size
and increasing I/O capacity and speed. The great increase in processor speed has been made
possible by the reduction in the size of processor components, which reduction has cut down the
processing distances between components resulting in higher performance speeds. Modern
processor organization has also contributed to the current gains in processor speed of which the
most common organization technique is speculative execution in which the processor anticipates
instructions that might be needed in the near processing future and then processes them
beforehand.
We now look at this history through the technological events and milestones which separated the
different generations that marked the evolution of computers up to the current state-of-the-art.
2.1 The First Generation: Vacuum Tubes
The ENIAC [Electronic Numerical Integrator And Computer] was the first general purpose digital
electronic computer. It was designed and constructed at the University of Pennsylvania by John
Mauchly and John Eckert as a direct response to the requirements of the US army in WW2. The
project began in 1943, but by the time it was completed in 1946, the war target had already been
missed. The machine was huge, weighing more than 30 tonnes, containing 18,000 vacuum tubes and
covered about half a football pitch. It consumed unreasonably huge amounts of power and
compensated for this by offering processing speeds that were faster than any existing
electromechanical computer at the time.
The ENIAC was a decimal [and not a binary] machine. Memory was made up of 20 accumulators
each of which was capable of holding a 10-digit number. A ring of 10 vacuum tubes represented
each digit and at any time only one of the 10 vacuum tubes was in an ON state representing one of
the ten digits. A major setback in ENIAC architecture was that it was manually programmed by
setting switches and plugging and unplugging cables.
2.1.1 The von Neumann Machine was designed by John von Neumann, a Mathematician who was
also on the ENIAC project as a consultant. The von Neumann machine was necessitated by the
tedium which was associated with manually entering and altering programs in the ENIAC. A suitably
automated programming process could be facilitated if the program could be represented in a form
21 | P a g e
that made it suitable to store the program in memory alongside the data. This would make it
possible for the computer to get its instructions by reading them from memory and a program could
be set or altered by setting the values of certain portions of memory. This is what we now know as
the stored program concept and is attributable to John von Neumann and also to a certain extent,
the team that did the job on the ENIAC.
The first publication of the stored program concept was in a 1945 proposal by von Neumann for a
new computer, the EDVAC [Electronic Discrete Variable Computer]. In 1946, von Neumann and his
colleagues began the design of the first stored program computer which was referred to as the IAS
computer which was completed in 1952 and became the prototype of all subsequent generalpurpose computers. The general structure of the IAS is depicted in Figure 1.3 in Chapter 1 which
highlights the general structure of the CPU.
The IAS structure consists of:
A Main Memory, which stores both data and instructions
An Arithmetic Logic Unit [ALU] capable of operating on binary data
A Control Unit [CU] which interprets the instructions in memory and causes them to be
executed
I/O equipment operated by the CU.
Von Neumanns proposal of 1945 which resulted in the building of the von Neumann machine
contained a broad specification which built a case for all components which constitute the structure
of the computer and emphasis was laid on the internal structure of the CPU. With rare exceptions,
all computers as we know them today have their basis on von Neumanns proposal and they are
therefore all referred to as von Neumann machines.
The memory of the IAS contains 1000 storage locations called words of 40 binary digits each. Both
data and instructions are stored there. Numbers are represented in the IAS in binary form and
instructions are in the form of a binary code. Each number is represented by a sign bit and a 39-bit
value. A word may also contain two 20-bit instructions, with each instruction consisting of an 8-bit
opcode specifying the operation to be performed and a 12-bit address designating one of the words
in memory. The CU operates the IAS by fetching the instructions from memory and executing them
one at a time.
Sign Bit
Figure 2.1(a)
Number Word
22 | P a g e
Left Instruction
8
Opcode
Address
Right Instruction
28
20
Opcode
39
Address
23 | P a g e
AC
MQ
I/O Equipment
MBR
IBR
PC
IR
MAR
Control
Circuits
Main Memory
Control Signals
Control Signals
24 | P a g e
Reading/Activity: Read the section von Neumann Machine in your prescribed textbook on
page 36 and study Figure 2.2 above. It is normal to encounter a question that will require you to
reproduce it as well as answer questions which derive their answers from the representation of
Figure 2.2.
Activity: Indicate the width in bits of each data path (e.g. between the AC and ALU)
25 | P a g e
were then installed on computers and other electronic components and the manufacturing process
from transistor to circuit board was extremely expensive and cumbersome.
Problems in the manufacturing of computer equipment began to be visible when it became
increasingly unavoidable to introduce newer and more powerful machines without increasing the
number of transistors in the machine. As these powerful machines were being churned out, they
contained tens of thousands of transistors and this figure sky-rocketed with newer innovations and
the making of these machines grew increasingly more difficult.
In 1958 however came the achievement that revolutionized electronics and started the era of
microelectronics. This is the invention of the integrated circuit. It is the integrated circuit that defines
the Third Generation of Computers. The integrated circuit exploits the fact that such components as
transistors, resistors and conductors can be fabricated from a semi-conductor such as silicon. It is
merely an extension of the solid-state art to fabricate an entire circuit in a tiny piece of silicon rather
than assemble discrete components made from separate pieces of silicon into the same circuit.
Many transistors can be produced at the same time on a single wafer of silicon. These transistors can
be connected with a process of metallization to form circuits. A single wafer of silicon is divided into
chips and the chips then individually contain gates [which stipulate rules that make up logic If A
and B are TRUE, then C is also TRUE], memory cells and a number of input and output attachment
points. The chip is packaged in a housing that protects it and this housing provides pins for
attachment to devices beyond the chip itself.
Initially, only a few gates and memory cells could be reliably manufactured and packaged together.
These early integrated circuits are referred to as small-scale integration [SSI]. As time went on, it
became possible to pack more and more components into the same chip.
2.4 Later Generations
Beyond the third generation, there is less agreement on how to define the generations of
computers. Table 2.1 seems to suggest that there have been a number of later generations based on
advances in integrated circuit technology. With the introduction of large-scale integration [LSI],
more than 1,000 components can be placed on a single integrated circuit chip. Very Large-scale
integration [VLSI] achieved more than 10,000 components per chip while current ultra-large-scale
integration [ULSI] chips can contain more than one million components. With the rapid pace of
technology, the high rate of introduction of new products, and the importance of software and
communications as well as hardware, the classification becomes less clear and less meaningful.
Table 2.1 Computer Generations
Generation Approximate Dates
Technology
1
2
3
4
5
6
Vacuum Tube
Transistor
SSI and Medium SI
LSI
VLSI
ULSI
1946-1957
1958-1964
1965-1971
1972-1977
1978-1991
1991-
Typical Speed
[Operations/second]
40,000
200,00
1,000,000
10,000,000
100,000,000
1,000,000,000
26 | P a g e
screens inside the computer. Magnetized one way, a ring, called a core, represented a 1; magnetized
the other way, it stood for a 0. Magnetic-core memory was rather fast and it took as little as a
millionth of a second to read a bit stored in memory. But it was extremely expensive and bulky and
employed a technique called destructive readout: meaning that once read, the data stored in a core
would immediately be erased.
Semiconductor memory made its first appearance in 1970, and the chip, which was about the size of
a single magnetic-core could hold 256 bits of memory. It was non-destructive and much faster than
core. It took only one 70 billionth of a second to read a bit from memory, but the cost per bit was
higher than that of core. Developments continued in the semiconductor arena and in 1974, the price
per bit of semiconductor memory finally dropped below that of core. Following this, there has been
drastic continuing decline in the cost of semiconductor memory coupled with increasing physical
memory density. This has led to much smaller but still faster machines with memory sizes of the
larger and more expensive behemoths of the previous years. Since 1970, semiconductor memory
has been through 13 generations: 1K, 4K, 16K, 64K, 256K, 1M, 4M, 16M, 64M, 256M, 1G, 4G, and as
of this writing, 16Gbits can be packed on a single semiconductor memory chip. Each subsequent
generation has achieved 4 times the storage density of the previous one accompanied by declining
cost per bit and declining access time.
2.6 Microprocessors
Just as the density of elements on memory chips has continued to rise, so has the density on
processor chips. As the time went on, more and more elements were packed on each computer chip
so that fewer and fewer chips became necessary to construct a single computer processor. A
breakthrough was achieved in 1971 when Intel developed its 4004. The 4004 was the first chip to
contain all the elements of a CPU on a single chip: and the microprocessor was born.
The 4004 can add 2 4-bit numbers and can multiply only by repeated addition. In todays standards,
the 4004 is hopelessly primitive but it was the necessary starting point to a continuing evolution of
microprocessor capability and awesome power.
This evolution can be seen most easily in the number of bits that the processor deals with at a time.
There is no clear measure of this, but perhaps the best measure is the data bus width: the number of
bits of data that can be brought into or sent out of the processor at any one time. Another measure
is the number of bits in the accumulator or in the set of general purpose registers. Often these
measures coincide, but not always. For example, a number of microprocessors were developed that
can operate on 16-bit numbers but can only read and write 8-bits at a time.
The next step in the development of the microprocessor was the introduction in 1972 of the Intel
8008. This was the first 8-bit microprocessor and was twice as complex as the 4004. Neither of these
steps was to produce the impact of the next major event: the introduction in 1974 of the Intel 8080
the first general-purpose microprocessor. Whereas the 4004 and the 8008 were designed for
specific applications, the 8080 was designed to be the CPU of a general-purpose computer. The 8080
is an 8-bit microprocessor but much faster than the 8008 having a much richer instruction set and a
larger addressing capability. At the same time that all this was happening, 16-bit microprocessors
were being developed and at the end of the 1970s began to appear the 16-bit microprocessor of
which Intel dug in with their 8086. The next trend in these developments occurred when Bell
Laboratories and Hewlett Packard developed the first 32-bit microprocessor on a single chip. Intel
introduced their own 32-bit microprocessor, the 80386 in 1985.
27 | P a g e
8080
1974
2MHz
8-Bits
6,000
6
64KB
8086
1978
Up to 10MHz
16-Bits
29,000
3
1MB
386TM DX
1985
386TM SX
1988
386TM DX CPU
1989
Clock Speeds
Bus Width
Transistors
Feature Size [m]
Addressable Memory
Virtual Memory
Cache
16-33MHz
32-Bits
275,000
1
4GB
64TB
--
16-33MHz
16-Bits
275,000
1
16GB
64TB
--
25-50MHz
32-Bits
1.2Million
.8-1
4GB
64TB
8KB
6-12.5 MHz
16-Bits
134,000
1.5
16MB
1GB
--
8088
1979
5-8MHz
8-Bits
29,000
6
1MB
486TM SX
1991
16-33MHz
32-Bits
1.185Million
1
4GB
Pentium
1993
60-166 MHz
32-Bits
3.1Million
.8
4GB
Pentium Pro
1995
150-200MHz
64-Bits
5.5Million
.6
64GB
Pentium II
1997
200-300MHz
64-Bits
7.5Million
.35
64GB
64TB
8KB
64TB
8KB
64TB
512KB L1
1MB L2
64TB
512MB L2
Pentium 4
2000
1.3-1.8GHz
64-Bits
42Million
180
64GB
Core 2 Duo
2006
1.06-1.2GHz
64-Bits
167Million
65
64GB
Core 2 Quad
2008
3GHz
64-Bits
820Million
45
64GB
64TB
256KBL2
64TB
2MB L2
64TB
6MB L2
28 | P a g e
Branch Prediction: The processor looks ahead in the instruction code fetched from memory
and predicts which branches, or groups of instructions, are likely to be processed next. If the
processor guess right most of the time, it can pre-fetch the correct instructions and buffer
them so that it is kept busy and does not have to wait while instructions are fetched on
demand. They are fetched before they are needed. Multiple branches can be predicted.
Branch prediction has the effect of increasing the amount of work available for the
processor to execute.
Data Flow Analysis: The processor analyzes which instructions are dependent on each
others results, or data to create an optimized schedule of instructions and this prevents
unnecessary processing delays.
Speculative Execution: Using branch prediction and Data Flow Analysis, some processors
speculatively execute instructions before their actual appearance in the program execution,
holding their results in temporary locations. This helps keep the processing engine as busy as
possible by executing instructions that are likely to be needed well in advance of the time of
their being required.
29 | P a g e
These techniques make it possible to effectively and efficiently exploit the sheer power and raw
speed of the processor. But as the speed of the processor skyrockets, another acute problem arises
because a speed differential is created between the processor and the main memory. The interface
between main memory and the processor is the most crucial pathway in the computer because it is
responsible for carrying a constant flow of program instructions and data between memory chips
and the processor. If memory or the pathway fails to keep pace with the processors insistent
demands, the processor will stall in a wait state and valuable processing time is lost and the
processors raw power is severely undermined.
There are a number of ways in which a systems architect can attack this problem, all of which are
reflected in contemporary computer designs. Consider the following:
Increase the number of bits that are retrieved at any one time by making the DRAMs wider
rather than deeper and by using wide data paths.
Change the DRAM interface to make it more efficient by including a 1cache or other
buffering scheme on the DRAM chip.
Reduce the frequency of memory access by incorporating increasingly complex and efficient
cache structures between the processor and main memory. This includes the incorporation
of one or more caches on the processor chip as well as an off-chip cache close to the
processor chip.
Increase the interconnect bandwidth between processors and memory by using higher
speed buses and a hierarchy of buses to buffer and structure data flow.
8080: The worlds first general-purpose microprocessor. This was an 8-bit machine with an
8-bit data path to memory. The 8080 was used in the first personal computer, the Altair.
8086: A far more powerful 16-bit machine. In addition to a wider data path and larger
registers, the 8086 sported an instruction cache or queue that pre-fetched a few instructions
before they were executed. A variant of this processor, the 8088, was used in IBMs first
personal computer, securing Intels success as a microprocessor manufacturer. The 8086 is
the first appearance of the x86 architecture.
80286: This extension of the 8086 enabled the addressing of 16Mbyte memory instead of
just 1Mbyte.
A cache is a relatively small but fast memory interposed between a larger but slower memory and the system logic that
accesses the larger memory. The cache holds recently-accessed data and is designed to speed the subsequent access to
that data.
30 | P a g e
80386: This was Intels first 32-bit machine and represented a major overhaul in the Intel
product line. With a 32-bit architecture, the 80386 rivalled the complexity and power of
minicomputers and mainframes introduced just a few years earlier. This was the first Intel
processor to support multi-tasking, meaning that it could run multiple programs all at the
same time.
80486: This microprocessor introduced the use of much more sophisticated and powerful
cache technology as well as instruction pipelining. A pipeline works in much the same way as
an assembly line in a manufacturing plant enabling different stages of execution of different
instructions to occur at the same time along the pipeline. The 80486 also included a built-in
math co-processor, relieving the CPU of complex math operations.
Pentium: With the Pentium, Intel introduced the use of superscalar techniques, which allow
multiple instructions to execute in parallel. Superscalar techniques allow multiple pipelines
within a single processor so that instructions that do not depend on one another can be
executed in parallel.
Pentium Pro: This processor continued the move into superscalar organization begun with
the Pentium, with aggressive of register renaming, branch prediction, data flow analysis and
speculative execution.
Pentium II: The Pentium II incorporated Intel MMX technology which is designed specifically
to process video, audio and graphics data efficiently.
Pentium III: This processor incorporates additional floating-point instructions to support 3D
graphics software.
Pentium 4: The Pentium 4 includes additional floating-point instructions and other
enhancements for multimedia.
Core: This is the first Intel x86 microprocessor with a dual core, referring to the
implementation of two microprocessors in a single chip.
Core 2: The Core 2 extends the architecture to 64-bits. The Core 2 Quad provides 4
processors on a single chip.
Over 30 years after its introduction in 1978, the x86 architecture continues to dominate the
processor architecture outside of embedded systems. Although the organization and technology of
the x86 has changed dramatically over the decades, the instruction set architecture has evolved to
remain backward compatible with earlier versions. All changes to the instruction set architecture
have involved additions to the instruction set with no subtractions. The rate of change has been one
instruction per month added to the instruction set over the last 30 years so that there are now over
500 instructions in the instruction set.
31 | P a g e
32 | P a g e
The origins of ARM can be tracked back to the British-based Acorn Computers Company. In the early
80s Acorn was awarded a contract by the BBC to develop new microcomputer architecture for the
BBC Computer Literacy project. The success of this contract enabled Acorn to go on to develop the
first commercial RISC processor, the Acorn RISC Machine. The first version became operational in
1985 and was used for internal research and development as well as being used as a coprocessor in
the BBC machine. Also in 1895, Acorn released the ARM2, which greater functionality and speed
within the same physical space. Further improvements were achieved with the release in 1989 of
ARM3. Throughout this period, Acorn used the companys VLSI technology to do the actual
fabrication of the processor chips.
33 | P a g e
Self-Assessment Exercise
2.0 List the six basic types of register and state:
2.0.1 Their Function
2.0.2 Where they are found within the CPU?
2.1 Explain clearly what is meant by the stored program concept
2.2 Identify the first two major breakthroughs that ushered the technological revolution
characterizing the period between the first and third generations of computers.
2.3 Give a definition of what a solid state device is while citing an example and list the
advantages that solid state devices introduced into the Computer Architecture domain.
2.4 Explain what is meant by data channel and state the biggest advantage(s) that data
channels bring to Computer Architecture and design.
2.5 What is meant by ferromagnetic material and what is/was this material meant for?
2.6 How was ferromagnetic material used in the implementation of technology up until the
third generation of computers?
2.7 Define:
2.7.1 Pipelining
2.7.2 Superscalar processing
2.8 Clearly distinguish between CISC and RISC giving a simple example which makes use of a
simplified high-level language instruction like ADD, SUBTRACT, MULTIPLY, etc. your
illustration must show how either implementation treats the instruction at processor level.
Assume that all the variables are in memory.
Problem
2.0 The ENIAC was a decimal machine, where a register was represented by a ring of vacuum
tubes. At any time, only one vacuum tube was in an ON state, representing one of the 10
digits. Assuming that the ENIAC had the capability to have multiple vacuum tubes in the ON
and OFF state simultaneously, why is this representation wasteful and what range of
integer values could we represent using the 10 vacuum tubes?
34 | P a g e
Relate to, understand and explain the main instruction addressing modes and formats
Obtain a deeper grasp of computer performance and performance metrics
Answer questions relating to the Control Unit and its operations
Demonstrate a basic understanding of the micro-operations that the CU goes through in
order to influence the completion of one computer instruction
Demonstrate by ably and aptly answering questions relating to the performance and
performance metrics of the Control Unit
Think Point: Consider the advantages and disadvantages of this addressing technique and
list them below.
Advantage:
Disadvantage:
35 | P a g e
36 | P a g e
Reading/Activity: Having come thus far in this unit, please read pages 418-431 in the
prescribed text for this module and answer the questions under the following activity:
Activity
Briefly define:
Immediate Addressing
Direct Addressing
Indirect Addressing
Register Addressing
Register Indirect Addressing
Stack Addressing
37 | P a g e
Register versus Memory: A machine must have registers so that data can be brought into the
processor for processing. With a single user-visible register (called an accumulator), one
operand address is implicit and consumes no instruction bits. However single register
programming is awkward and requires many instructions. The more that registers can be
used for operand references, the fewer bits are needed. Most studies indicate that a total of
between 8 and 32 registers is desirable.
Number Of Register Sets: Most contemporary machines have one set of general purpose
registers, with typically 32 or more registers in the set. The addresses can be used to store
data, or can be used to store addresses in displacement addressing.
Address Range: For addresses that reference memory, the number of addresses that can be
referenced is related to the number of address bits. Because this imposes severe limitations,
direct addressing is seldom used. With displacement addressing, the range is opened up to
the length of the address register. Even so, it is still convenient to allow for large
displacements from the register address, which requires a relatively large number of address
bits in the instruction.
Reading/Activity: Please read pages 431-444 in the prescribed text for this module and
answer the questions under the following activity:
Activity
Briefly define:
What facts go into determining the use of the addressing bits of an instruction?
What are the advantages and disadvantages of using a variable-length instruction format?
What is the advantage of auto-indexing?
What is the difference between post-indexing and pre-indexing?
38 | P a g e
39 | P a g e
MIPS rate =
Ic
_____
T x 106
____________
CPI x 106
3.3.2 Benchmarks
Measures such as the MIPS have proved inadequate in the evaluation of the performance of
processors. Because of differences in instruction sets, the instruction execution rate is not a valid
means of comparing the performance of different architectures. Consider the high-level language
code:
A = B + C /* Assume all quantities are in memory
Using the CISC, this instruction can be compiled into one processor instruction:
Add mem(B), mem(C), mem(A)
On a typical RISC machine, the compilation will look something like the following:
Load mem(B), reg(1);
Load mem(C), reg(2);
Add reg(1), reg(2), reg(3);
Store reg(3), mem(A);
Because of the nature of the RISC machine, both machines may execute the original high-level
language instruction in about the same time. This would mean that the CISC machine may be rated
at 1MIPS while the RISC machine may be rated at 4MIPS though the two machines do the same
amount of work in the same amount of time.
The performance of a processor on a given program may not be a useful indication on how the same
processor will perform on a different application. Therefore beginning in the 1980s and early 1990s,
industry and academic interest shifted towards measuring the performance of systems using a set of
benchmark programs. The same set of programs can be run on different machines and the execution
times compared. A desirable benchmark program must have the following characteristics:
40 | P a g e
The ALU is the functional essence of the computer. Registers are used to hold data internal to the
processor and some registers contain status information needed to manage instruction sequencing.
Others contain data that go to or come from the ALU, memory and I/O modules. Internal data paths
are used for inter-register data movement and between registers and the ALU. External data paths
link registers to memory and I/O modules, often by means of a system bus. The CU therefore causes
operations to happen within the processor.
The execution of a program consists of operations involving these processor elements which are
altogether controlled and galvanized by the CU. These operations consist of sequences of microoperations and all micro-operations fall into the following categories:
The transfer of data from one register to another.
The transfer of data from a register to an external interface.
The reverse route: The transfer of data from an external interface to a register.
Preforming the arithmetic or logic operation, using registers for input and output.
All the micro-operations needed to perform one instruction cycle, including all the micro-operations
to execute every instruction in the instruction set, fall into one of these categories.
In a nutshell, the Control Unit performs two basic tasks:
Sequencing: The CU causes the processor to step through a series of micro-operations in the
proper sequence based on the program being processed.
The CU causes each micro-operation to be performed.
3.4.1 Control Signals
The key to how the CU operates is its use of signals. For the control unit to effectively perform its
function, it must have inputs that allow it to determine the state of the system and outputs that
allow it to control the behavior of the system. These are the external specifications of the CU.
Internally, the CU must have the logic required to perform its sequencing and execution functions.
Figure 3.1 is a general model showing The CU inputs and outputs. The inputs are as follows:
Clock: This is how the CU keeps time. The CU causes one micro-operation, or a series of
micro-operations to be executed or performed for each clock pulse. This is sometimes
referred to as the processor cycle time or the clock cycle time.
Instruction Register: The opcode of the current instruction is used to determine which
micro-operations to perform during the execute cycle.
Flags: These are needed by the CU to determine the status of the processor and the
outcome of previous ALU operations.
Control Signals From Control Bus: The control bus portion of the system bus provides signals
to the CU, such as interrupt signals and acknowledgements.
41 | P a g e
Figure 3.1 Model Of The Control Unit Showing All Its Inputs And Outputs
There are three types of control signals that are used by the Control Unit:
There are those that activate an ALU function
Those that activate a data path, and
Those that are signals on the external system bus or other external interface. All these
signals are applied directly as binary inputs to individual logic gates.
Let us consider a fetch cycle to see how the control unit maintains its control over the function of
the system. The CU keeps track of where it is in the instruction cycle. At any given point, it knows
that the fetch cycle is to be performed next. The first step is to transfer the contents of the PC to the
MAR. The CU does this by activating the contents of the control signal that opens the gates between
the bits of the PC and the bits of the MAR. The next stage is to read a word from memory into the
MBR and increment the PC. The CU does this by sending the following signal simultaneously:
A control signal that opens gates, allowing the contents of the MAR onto the address bus
A memory-read control signal onto control bus
A control signal that opens the gates, allowing the contents of the data bus to be stored in
the MBR
Control signals to logic that adds 1 to the contents of the PC and stores the results back to
the PC.
Following this, the CU sends a control signal that opens gates between the MBR and the IR. This
completes the fetch cycle except for one thing: the CU must decide whether to perform an indirect
cycle or an execute cycle next. To decide this, it examines the IR to see if an indirect memory
reference is made.
42 | P a g e
At the beginning of the fetch cycle, the address of the next cycle to be executed is in the PC.
The first step is to move that address to the MAR because this is the only register connected
to the address lines of the system bus.
The second step is to bring in the instruction. The desired address in the MAR is placed on
the address bus and the result appears on the data bus and is copied into the MBR. By now,
the PC needs to be incremented so that it points to the address of the next instruction that
comes after the current one under system scrutiny. These two actions, the reading of the
instruction address into the MBR and the incrementation of the PC are independent actions,
the system can do them simultaneously.
The third step is to move the contents of the MBR to the IR. This frees up the MBR for use
during a possible indirect cycle.
This fetch cycle therefore consists of three steps and four micro-operations. Each micro-operation
involves the movement of data into or out of a register. These micro-operations can be symbolically
represented as follows:
MAR
MBR
PC
IR
43 | P a g e
(IR [Address])
Memory [MAR]
(R1) + (MBR)
In the first step, the address portion of the IR is loaded in the MAR. The referenced memory location
is read and the contents of R1 and MBR are added by the ALU.
We have come to the end of this unit and all that is left to do is to look at the Activity below and
complete it. Having read all the material above, you should complete this reading activity within 90
minutes.
44 | P a g e
Reading/Activity: Please read pages 579-602 in the prescribed text for this module and
answer the questions under the following activity:
Self-Assessment
1 Explain the difference between the written sequence and the time sequence of an instruction.
2 What is the relationship between instructions and micro-operations?
3 What is the overall function of a processors CU?
4 Outline a 3-step process that leads to a characterization of the CU?
5 What basic tasks does the CU perform?
6 Provide a typical list of inputs and outputs of the CU
7 List three Types of control signals
Problems
3.0 Consider 2 different machines with two different instruction sets, both of which have a clock
rate of 200 MHz. The following measurements are recorded on the two machines running a
given set of benchmark programs:
Machine
A
Instruction Type
Instruction Count
CPI
8
4
2
4
10
8
2
4
1
3
4
3
1
2
4
3
Determine the effective CPI, MIPS and execution time for each machine. Comment on the results.
3.1 While browsing at Billy Bobs computer store, you overhear a customer asking Billy Bob what
the fastest laptop computer in the store that he can buy is. Billy Bob replies, You are looking at
our Acers. The fastest Acer we have has a Core 2 Quad processor that runs at a speed of 2.0 GHz.
If you really want the fastest machine, you should buy our 2.4 GHz Core 2 Duo Mac. Is Billy Bob
correct? What would you say to help this customer?
45 | P a g e
Relate to, understand and explain the main numbering systems of which the focus of this
unit is the binary numbering system
Understand and explain why the Binary numbering system is so cardinal to computing
Perform basic Addition, Subtraction, Multiplication and Division operations in Binary
Convert given numbers from Binary to Decimal and the reverse
Understand basic Boolean Algebra and Logic gates and the representations thereof
Understand the functions and importance of Assemblers and Compilers
The Number System that we are all familiar with is the decimal number system which we use every
day and this system is anchored on, or expressed to base-10. When we write an expression like the
following: 24 + 13 or 28 7; the numerals or collective digits involved are in decimal or expressed to
base 10. There are other representations that can be used, the most common examples of which
are:
Octal: In which numbers are expressed to base 8, and
Hexadecimal: In which the numbers are expressed to base 16.
However in the field of computing, the most important numbering system is the binary system
which expresses or represents numbers to base-2. Because this is the cardinal numbering system in
the world of computing, it is therefore such as we will concentrate and focus our energies on in this
unit.
4.1 Why Binary?
Computers are built from transistors, and an individual transistor can only exhibit one of two states
at any given time - be ON or OFF [Two Options]. Similarly, data storage devices can be optical or
magnetic. Optical storage devices store data in a specific location by controlling whether light is
reflected off that location or is not reflected off that location [Two Options]. Likewise, magnetic
storage devices store data in a specific location by magnetizing the particles in that location with a
specific orientation. We can have the north magnetic pole pointing in one direction, or the opposite
direction [Two Options].
Computers therefore can most readily use two symbols, and therefore a base-2 system, or a binary
number system, is most appropriate. The base-10 number system [Decimal] has 10 distinct symbols:
0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. The base-2 system has exactly two symbols: 0 and 1. The base-10
symbols are termed digits. The base-2 symbols are termed binary digits, or bits for short. All base-10
numbers are built as strings of digits (such as 6349). All binary numbers are built as strings of bits
(such as 1101). Just as we would say that the decimal number 12890 has five digits, we would say
that the binary number 11001 is a five-bit number. The point: All data in a computer is represented
in binary and is read by the computer in bits.
4.2 Converting A Binary Number To Decimal
To convert a binary number to a decimal number, we simply write the binary number as a sum of
powers of 2. For example, to convert the binary number 1011 to a decimal number, we note that the
rightmost position is the ones position and the bit value in this position is a 1. So, this rightmost bit
has the decimal value of 1x20. The next position to the left is the twos position, and the bit value in
this position is also a 1. So, this next bit has the decimal value of 1x21. The next position to the left is
46 | P a g e
the fours position, and the bit value in this position is a 0. The leftmost position is the eights position,
and the bit value in this position is a 1. So, this leftmost bit has the decimal value of 1x23. Thus:
1011 = (1x23) + (0x22) + (1x21) + (1x20) = Decimal 11 or 1110.
Think Point: Look closely at the conversion demonstration below and following through the
steps shown, then make an attempt to convert the binary number given to decimal.
To convert the binary number 10101 to decimal, we annotate the position values below the bit
values:
1 0 1 0 1
16 8 4 2 1
Then we add the position values for those positions that have a bit value of 1: 16 + 4 + 1 = 21. Thus
101012 = 2110
Convert The Binary number 1100101101101 to decimal.
Answer:
47 | P a g e
In the fourth step, we subtract 23 from 10 and we get a 2. What is the highest power of 2 which is
equal to or less than 2? The answer is 2 which is 21. We denote this position as (1x21).
In the fifth step we subtract 21 from 2 and the answer is 0. At this juncture, the algorithm ends. Now
we look at the positions we have denoted above:
(1x26), (1x24), (1x23) and (1x21). This means that from the position 26 to the position 20 only these
denoted here will have a value 1, the rest will have a value 0 each. Now we list down the binary
number obtained:
26
1
64
25
0
0
24
1
16
23
1
8
90
22
0
0
21
1
2
20
0
0
4.3.2 Method 2
The second method of converting a decimal number to a binary number entails repeatedly dividing
the decimal number by 2, keeping track of the remainder at each step. To convert the decimal
number x to binary:
Step 1. Divide x by 2 to obtain a quotient and remainder. The remainder will either be 0 or 1.
Step 2. If the quotient is zero, you are done: Proceed to Step 3. Otherwise, go back to Step 1,
assigning x to be the value of the most-recent quotient from Step 1.
Step 3. The sequence of remainders forms the binary representation of the number writing the
remainders from last to the first.
Let us convert the decimal number 71 to binary using this method:
2 into
2 into
2 into
2 into
2 into
2 into
2 into
2 into
71
35
17
8
4
2
1
0
Remainder 1
Remainder 1
Remainder 1
Remainder 0
Remainder o
Remainder 0
Remainder 1
Now, taking up all the remainders from the bottom, 7110 = 10001112.
48 | P a g e
Reading/Activity: If you have access to the prescribed textbooks online resources, please
read online Chapter 19 which exclusively deals with Number Systems. At this stage read sections 191 to 19-3. If you have no access to this resource, read these guide notes and also as much as
possible, search the internet for related information. There is a tremendous lot of relevant
information and notes on number Systems alone. Try to go through the various examples and
questions given, and gauge your retention level.
Self-Assessment
1 Explain as briefly and comprehensibly as possible why the binary number system is the preferred
number system in computing.
2 Using Method 1, convert 21010 to binary.
3 Using Method 2, convert 21010 to binary and compare your answer to 1 above.
4 Convert 101101101 to decimal.
49 | P a g e
So we wind up with a total of 1001000. This number is 7210. We may now verify this addition using
decimal. The top number 101101 = 4510 and the bottom number 11011 = 2710. Adding these
numbers in decimal will give us 7210.
4.5 Binary Subtraction
We would like to use the same numbers for conveniences sake to accomplish an example on binary
subtraction. Binary subtraction is quite straight forward as we shall see. The subtraction of the
current two numbers follows.
4.5.1 Method 1 The Borrow method
We will use the borrow method when we find ourselves in situations where we have to subtract 1
from 0
101101
+ 011011
010010
= 4510
=2710
=1810
Again, moving from right to left, the first column is simple because we are subtracting a binary digit
(bit) 1 from another binary digit 1 giving us the answer 0. Moving on to the second column, we
are to subtract a 1 bit form a 0 bit. This means that we have to borrow from the next column (in
this case from the third column). The o bit in column two becomes 10 and the 1 bit in column
three becomes 0. Going back to column two, we have 10 minus 1, we get a 1 bit, which we
put down. Moving on to the third column, we know already that our 1 bit is now zero since it was
borrowed by column two. So a 0 bit is subtracted from this 0 bit and the answer is zero, which
we put down. We move onto the fourth column. A 1 bit is being subtracted from a 1 bit which
yields another 0 bit. Moving onto column number five, we encounter that awkward scenario
again, where a 1 bit is being subtracted from a 0 bit. Again we borrow the 1 bit from the sixth
column, and the 0 bit in the fifth column becomes 10. Now, 10 minus 1, we get a 1 bit,
which we put down, as usual. Going onto the sixth column, the 1 bit has been borrowed by the
fifth column, so we have a 0 bit here, not a 1 bit. So 0 minus 0 will give us a zero, which we
may put down for conveniences sake.
= 4510
= -2710
= 1810
In the result you may have observed an absurd 1 bit in parentheses. It is an overflow in our
calculation and may be ignored. The answer therefore is 010010 which is 1810.
50 | P a g e
110
x 10
2. We begin by multiplying 1102 by the rightmost digit of our multiplier which
000
is 0. Any number times zero is zero, so we just write zeros below.
3. Now we multiply the multiplicand by the next digit of our multiplier which
is 1. To perform this multiplication, we just need to copy the multiplicand
and shift it one column to the left as we do in decimal multiplication.
110
x 10
000
110
110
x 10
000
110
1100
51 | P a g e
Now let's look at a simple division problem in binary: 112 / 102 or 310 / 210. This time 102 is the divisor
and 112 is the dividend. The steps below show how to find the quotient which is 1.12.
1. First, we need to find the smallest part of our dividend that is greater than
or equal to our divisor. Since our divisor has two digits, we start by checking
10|11
the first two digits of the dividend.
1
2. 11 is greater than 10, so we write a 1 in the quotient, copy the divisor below 10|11
the dividend, and subtract using the borrow method.
10
1
1.
3. Since we have no more digits in our dividend, but we still have a remainder,
10|11.0
our answer must include a fraction. To finish our problem we need to mark
10
the radix point and append a zero to the dividend.
1
1.
4. Now we bring down the extra zero and write it beside our remainder. Then
10|11.0
we check to see if this new number is greater than or equal to our divisor.
10
Notice we ignore the radix point in our comparison.
10
1.1
10|11.0
5. 10 equals the divisor 10, so we write a 1 in the quotient, copy the divisor
10
below the dividend, and subtract. This completes our division because we
10
have no more digits in the dividend and no remainder.
10
0
Reading/Activity: If you have access to the prescribed textbooks online resources, please
read online Chapter 19 which exclusively deals with Number Systems. At this stage read sections 191 to 19-3. If you have no access to this resource, read these guide notes and also as much as
possible, search the internet for related information. There is a tremendous lot of relevant
information and notes on number Systems alone. Try to go through the various examples and
questions given, and gauge your level.
Self-Assessment
1 Add 1100101 to 101011 showing your working as clearly as possible
2 Perform this Addition: 10010111 + 11100110
3 Perform this subtraction 11111110 - 10010111 using both the borrow method and 2s complement
4 Perform the following subtraction showing your working 11100110 - 10010111 using both the
borrow method and 2s complement
5 What is 2s complement representation and why is it important?
6 convert 100100111010 to 2s complement.
52 | P a g e
53 | P a g e
54 | P a g e
55 | P a g e
(
) = .
() =
Associative:
(A.B).C = A.(B.C)
(A +B) + C = A + (B + C)
Commutative:
A.B = B.A
A+B=B+A
Distributive:
A.(B + C) = A . B + A.C
A + (B.C) = (A + B).(A + C)
Note: The OR operator is represented by the + symbol and has the lowest precedence. The NOT
operator is represented by the symbol and has the highest precedence. The AND operator is
represented by . symbol and is in the middle.
As an example, let us simplify a Boolean expression using some of the above rules. Thereafter we
can look at how we can build logic circuits from these Boolean expressions.
The expression that we need to simplify is AB D + A D + BCD + A CD + BC
Using the above rules we can simplify the expression as follows:
F = AB D + A D + ABCD + A CD + BC
= AB D + A D + BCD + BC + A CD
= A D(B+ ) + BC(D+ ) + A CD [ (0 + = 0 + 1 = 1 also 1 + = 1+ 0 = 1)]
= A D(1) + BC(1) + A CD
= A D + BC + A CD
= A D + A CD + BC
= AD( + C) + BC
= AD(( + )( +C)) + BC [Distributive Law]
= AD(( + )(1)) + BC
F = AD( ) + BC
[DeMorgans]
You can solve any similar problem the very same way.
56 | P a g e
F=a+b+c
F=a+
F = b + ab
F = ( + b).(a + b)
The two designs in (a) are the same but the lower design is more expensive because it contains more
logic gates and circuitry. The upper design is therefore more desirable because it is minimalist and
achieves the same goal as the lower one. We move on to (b) and (c). (b) is a simple logic OR circuit in
which one of the inputs is inverted, but (c) is an example of a Sum Of Products (SOP). The products
(ab and b) are summed (+).
57 | P a g e
Finally we move on to (d) which is an example of a Product Of Sums (POS) in which the sums
involving input a and input b are ANDed. See next page.
58 | P a g e
59 | P a g e
Reading/Activity: If you have access to the prescribed textbooks online resources, please
read online Chapter 20 which exclusively deals with Digital Logic. At this stage read sections 20-1 to
20-5. If you have no access to this resource, read these guide notes and also as much as possible,
search the internet for related information.
You are encouraged to do a little research on such topics as Sum Of Products (SOP), Product Of Sums
(POS), Truth Tables and The Use of Truth Tables.
Self-Assessment
1 Divide 00011 into 10010 showing your working as clearly as possible
2 Perform this Multiplication: 1010 x 1100
3 Reduce the expression F = b + a c + abc to its simplest form
4 Draw the resultant logic circuit.
60 | P a g e
program, as well as do symbolic translations, also does some form of memory address allocations to
symbolic addresses. The development of assembly language was a major milestone in the evolution
of computer technology and was the first step to the high-level languages that are in use today.
Although few programmers use assembly language today, virtually all machines provide one.
Assembly language interacts with systems programs such as compilers and I/O routines.
61 | P a g e
Depict, represent and explain the general computer memory-cache hierarchical model
Relate to and understand cache architecture and the different ways in which cache is
implemented in relation to main memory
Demonstrate an understanding of cache memory organization
Demonstrate a good understanding of what Internal Memory is and identify the common
types of internal memory that are commonly used
Be conversant with external memory, the types thereof and the different types and levels of
redundancy that can be implemented on them
Obtain a broad understanding of I/O and comprehensively explain the three main
techniques used for I/O implementation and organization
Cache represents a small high speed memory usually Static RAM (SRAM) that contains the most
recently accessed pieces of main memory. Why is this high speed memory necessary or beneficial?
In todays systems, the time it takes to bring an instruction (or piece of data) into the processor is
very long when compared to the time to execute the instruction. For example, a typical access time
for DRAM is 60ns. A 100 MHz processor can execute most instructions in 1 CLK or 10 ns. Therefore a
bottle neck forms at the input to the processor.
Cache memory helps by decreasing the time it takes to move information to and from the processor.
A typical access time for SRAM is 15 ns. Therefore cache memory allows small portions of main
memory to be accessed 3 to 4 times faster than DRAM (main memory). How can such a small piece
of high speed memory improve system performance?
The theory that explains this performance is called Locality of Reference. The concept is that at
any given time the processor will be accessing memory in a small or localized region of memory. The
cache loads this region allowing the processor to access the memory region faster. How well does
this work?
In a typical application, the internal 16K-byte cache of a Pentium processor contains over 90% of
the addresses requested by the processor. This means that over 90% of the memory accesses occurs
out of the high speed cache. So now the question, why not replace main memory DRAM with SRAM?
The main reason is cost. SRAM is several times more expensive than DRAM. Also, SRAM consumes
more power and is less dense than DRAM. Now that the reason for cache has been established, let
us look at a simplified model of a cache system.
62 | P a g e
CPU
Cache
memory
B
U
S
Main DRAM
Memory
System Interface
63 | P a g e
Snoop: When a cache is watching the address lines for a transaction, this is called a snoop.
This function allows the cache to see if any transactions are accessing memory it contains
within itself.
Snarf: When a cache takes the information from the data lines, the cache is said to have
snarfed the data. This function allows the cache to be updated and thereby maintaining
consistency.
Snoop and snarf are the mechanisms the cache uses to maintain consistency. Two other terms are
commonly used to describe the inconsistencies in the cache data, these terms are:
Dirty Data: When data is modified within cache but not modified in main memory, the data
in the cache is called dirty data.
Stale Data: When data is modified within main memory but not modified in cache, the data
in the cache is called stale data.
5.2 Cache Architecture
Caches have two characteristics, [1] a read architecture and, [2] a write policy. The read architecture
may be either Look Aside or Look Through. The write policy may be either Write-Back or
Write-Through. Both types of read architectures may have either type of write policy, depending
on the design. Write policies will be described in more detail in the next section. Let us examine the
read architecture now.
64 | P a g e
CPU
SRAM
B
U
Cache Controller
Tag RAM
System Interface
Look-Aside Cache Read Architecture: Figure 5-2 shows a simple diagram of the Look-Aside
cache architecture. In this diagram, main memory is located opposite the system interface.
The discerning feature of this cache unit is that it sits in parallel with main memory. It is
important to notice that both the main memory and the cache see a bus cycle at the same
time. Hence the name look aside.
When the processor starts a read cycle, the cache checks to see if that address is a cache hit. A hit
results from a scenario where the cache contains the memory location, then the cache will
respond to the read cycle and terminate the bus cycle. A miss results from the scenario
where the cache does not contain the memory location, then main memory will respond to
the processor and terminate the bus cycle. The cache will snarf the data, so next time the
processor requests this data it will be a cache hit.
Look-Aside caches are less complex, which makes them less expensive. This architecture
also provides better response to a cache miss since both the DRAM and the cache see the
bus cycle at the same time. The drawback is that the processor cannot access cache while
another bus master is accessing main memory.
65 | P a g e
Look-Through Cache Read Architecture: Figure 5.3 shows a simple diagram of this cache
architecture. Again, main memory is located opposite the system interface. The discerning
feature of this cache unit is that it sits between the processor and main memory. It is
important to notice that cache sees the processors bus cycle before allowing it to pass on
to the system bus. When the processor starts a memory access, the cache checks to see if
that address is a cache hit. If its a HIT, the cache responds to the processors request
without starting an access to main memory. If its a MISS, the cache passes the bus cycle
onto the system bus. Main memory then responds to the processors request. Cache snarfs
the data so that next time the processor requests this data, it will be a cache hit.
This architecture allows the processor to run out of cache while another bus master is
accessing main memory, since the processor is isolated from the rest of the system.
However, this cache architecture is more complex because it must be able to control
accesses to the rest of the system. The increase in complexity increases the cost. Another
down side is that memory accesses on cache misses are slower because main memory is not
accessed until after the cache is checked. This is not an issue if the cache has a high hit rate
and there are other bus masters. Figure 5.3 shows a depiction of the Look-Through cache
Read Architecture.
66 | P a g e
CPU
SRAM
Cache Controller
Tag RAM
System Interface
SRAM: Static Random Access Memory (SRAM) is the memory block which holds the data.
The size of the SRAM determines the size of the cache.
Tag RAM: Tag RAM (TRAM) is a small piece of SRAM that stores the addresses of the data
that is stored in the SRAM.
Cache Controller: The cache controller is the brains behind the cache. Its responsibilities
include: performing the snoops and snarfs, updating the SRAM and TRAM and implementing
the write policy. The cache controller is also responsible for determining if memory request
is cacheable and if a request is a cache hit or miss.
67 | P a g e
Cache Line
Cache Line
Cache Line
.
.
.
Cache Line
Cache Line
Cache Page
Cache Page
Cache Page
.
.
.
Cache Page
Cache Page
68 | P a g e
Main Memory
Line m
.
.
.
.
Line 4
Line 3
Line 2
Line 1
Line 0
Cache
Line m
.
.
.
Line 1
Line 0
69 | P a g e
CPU
L1 Cache
Memory
L2 Cache Memory
Main DRAM
Memory
System Interface
70 | P a g e
When developing a system with a Pentium processor, it is common to add an external cache.
External cache is the second cache in a Pentium processor system, therefore it is called a Level 2 (or
L2) cache. The internal processor cache is referred to as a Level 1 (or L1) cache. The names L1 and L2
do not depend on where the cache is physically located ( i.e., internal or external). Rather, it
depends on what is first accessed by the processor (i.e. L1 cache is accessed before L2 whenever a
memory request is generated). Figure 5.6 shows how L1 and L2 caches relate to each other in a
Pentium processor system.
5.10 Pentium Cache Organization
Both caches are 2-way set-associative in structure. The cache line size is 32 bytes, or 256 bits.
A cache line is filled by a burst of four reads on the processors 64-bit data bus. Each cache way
contains 128 cache lines. The cache page size is 4K, or 128 lines.
5.11 Operating Modes
Unlike the cache systems discussed in the Overview Of Cache, the write policy on the Pentium
processor allows the software to control how the cache will function. The bits that control the cache
are the CD (Cache Disable) and NW (Not Write-Through) bits. As the name suggests, the CD bit
allows the user to disable the Pentium processors internal cache. When CD = 1, the cache is
disabled, CD = 0 cache is enabled. The NW bit allows the cache to be either write-through (NW = 0)
or write-back (NW = 1).
5.12 Cache Consistency
The Pentium processor maintains cache consistency with the MESI5 protocol. MESI is used to allow
the cache to decide if a memory entry should be updated or invalidated. With the Pentium
processor, two functions are performed to allow its internal cache to stay consistent, Snoop Cycles
and Cache Flushing. The Pentium processor snoops during memory transactions on the system bus.
That is, when another bus master performs a write, the Pentium processor snoops the address. If
the Pentium processor contains the data, the processor will schedule a write-back. Cache flushing is
the mechanism by which the Pentium processor clears its cache. A cache flush may result from
actions in either hardware or software. During a cache flush, the Pentium processor writes back all
modified (or dirty) data. It then invalidates its cache (i.e., makes all cache lines unavailable). After
the Pentium processor finishes its write-backs, it then generates a special bus cycle called the Flush
Acknowledge Cycle. This signal allows lower level caches, e.g. L2 caches, to flush their contents as
well.
71 | P a g e
Self-Assessment
1 What are the differences among sequential access, direct access and random access?
2 What is the general relationship among access time, memory cost and capacity?
3 How does the principle of locality relate to the use of multiple memory levels?
4 What are the differences among direct mapping, associative mapping and set associative
mapping?
5 For a direct mapped cache, a main memory address is viewed as consisting of three fields. List and
define the fields.
72 | P a g e
RAM technology is divided into two technologies: Dynamic and Static. A dynamic RAM
(DRAM) is made up of cells that store data as charge on capacitors. The presence or absence
of charge on a capacitor is interpreted as a binary 1 or 0 respectively. Because capacitors
have a tendency to discharge, dynamic RAMs require periodic charge refreshing even with
power continuously supplied, to maintain storage of the data. Although a DRAM is used to
store a digital value, it is only an analog device.
In contrast, a Static RAM [SRAM] is a digital device that uses the same logic elements that
are used by and in the processor. In a SRAM, bits are stored using traditional flip-flop logic
gates configurations. A static RAM will hold its store as long as power is supplied to it.
Both SRAM and DRAM are volatile but a SRAM cell is more complicated and larger than a
DRAM cell. This means that a DRAM is denser and less expensive than a corresponding
SRAM cell. On the other hand a DRAM requires a supporting refreshing circuitry.
5.15 Types Of ROM
As the name suggests, Read Only Memory [ROM] contains a permanent pattern of data that
cannot be changed. A ROM is nonvolatile, meaning that no power source is necessary to
maintain the bit values that are in memory. While it is possible to read a ROM, it is not
possible to write new data into it. An important application of ROM is microprogramming.
Other potential applications include:
Library subroutines for frequently wanted functions
System programs
Function tables
For a modest-size requirement, the advantage of ROM is that the data or program is
permanently in main memory and need never be loaded from an external device. The ROM
is created like any other integrated circuit chip with the data actually wired into the chip as
part of the fabrication process. This presents two problems:
The data insertion step includes a relatively large fixed cost whether one of
thousands of copies of a particular ROM are fabricated.
There is no room for error. If one bit is wrong, the whole batch of ROMs must be
thrown out.
When only a small number of ROMs with a particular memory content is required, a less
expensive alternative is the programmable ROM [PROM]. Like the ROM, the PROM is
nonvolatile and may be written into only once. For the PROM, the writing process is
performed electrically and may be performed by a supplier or customer at a time later than
the original chip fabrication. Special equipment however is required for the writing or
programming process. PROMs provide flexibility and convenience but the ROMs remain
attractive for high-volume production runs.
Another variation of ROM is the read-mostly memory which is useful for applications in
which operations are far more frequent than write operations but for which nonvolatile
storage is required. There are three common types of read-mostly memory: EPROM,
EEPROM and flash memory.
73 | P a g e
The optically Erasable PROM [EPROM] is read and written electrically as with PROM.
However, for a write operation, all the storage cells must be erased to the same initial state
by exposure of the packaged chip to ultraviolet radiation. Erasure is performed by shining an
intense ultraviolet light through a window that is designed into the memory chip. This
erasure process can be performed repeatedly and each erasure can take up to 20 minutes
to complete. Thus, the EPROM can be altered multiple times and like the ROM and PROM, it
holds its data indefinitely. For comparable amounts of data, the EPROM is more expensive
than the PROM but has the advantage of multiple update capability.
A more attractive form of read-mostly memory is Electrically Erasable Programmable Read
Only Memory [EEPROM]. This is a read-mostly memory that can be written into at any time
without having to erase prior contents because the byte or bytes addressed are updated.
The EEPROM is therefore nonvolatile and flexibly updatable.
The last form of semiconductor memory is flash memory. It is so named because of the
speed with which it can be programmed. Like EEPROM, flash memory uses electrical erasing
technology. An entire flash memory can be erased in one to a few seconds which is much
faster than EEPROM. It is also possible to erase just a few blocks of memory rather than an
entire chip. Flash memory gets its name because the microchip is organized in such a
manner that it is possible to erase a section of memory cells in a single action or flash. Flash
memory however does not provide byte-level erasure.
Self-Assessment
1 What are the key properties of semiconductor memory?
2 what are the two senses in which the term Random Access Memory is used?
3 What is the difference between DRAM and SRAM in terms of application?
4 What is the difference between DRAM and SRAM in terms of characteristics such as speed, size
and cost?
5 List some applications of ROM.
6 Explain why one type of ROM is considered to be analog and the other digital.
7 Explain the differences between EPROM, EEPROM and flash memory
8 How does SDRAM differ from ordinary DRAM?
74 | P a g e
75 | P a g e
about a byte each. The error correction code is computed across all disks and stored
on (a) separate disk(s). This implementation uses fewer disks than RAID 1, but it is
still expensive.
RAID 3: This implementation is much like RAID 2 in its characteristics but only a
single redundant disk is used. This disk is called the parity drive. A parity bit is
computed for the full set of individual bits in the same position on all disks. If a drive
fails, the parity information on the redundant disk can be used to unscramble the
data from the failed disk on the fly.
Self-Assessment
1 Define the terms track, cylinder and sector.
2 What common characteristics are shared by all RAID levels?
3 What is the typical disk sector size?
4 How is data read from a magnetic disk?
5 Briefly define the seven RAID levels.
76 | P a g e
77 | P a g e
From The I/O Modules Point-Of-View: The module receives a READ command from
the processor and the I/O module reads the data from the desired peripheral into
the data register. After this, the I/O module interrupts the processor to notify the
processor that it has completed the task enforced upon it by the processor. The I/O
module enters wait mode to wait for the processor to request for the data and at
this request the I/O module places the data on the data bus.
78 | P a g e
79 | P a g e
From The Processors Point-Of-View: The processor issues a READ command and
continues immediately with yet another useful task. The processor then keeps track
of the instruction cycle, and at the end it checks for interrupts. If it finally finds the
presence of the relevant interrupt, it saves the current I/O interrupt context. The
processor then reads the data from the I/O module and writes it in memory. The
processor restores the saved context and resumes execution.
Design Issues: How does the processor determine which device issued the
interrupt? How are multiple interrupts dealt with? There are ways of identifying the
interrupting device:
o Software Poll The processor polls each I/O module with a separate
command line to test if it is the interrupting I/O module. Processor does this
by reading the status register of the I/O module. This technique is time
consuming.
o Daisy Chain Hardware Poll This technique uses a common interrupt
request line for all I/O modules. Processor sends interrupt
acknowledgement. The requesting I/O module places a word of data on the
data lines, and this data word on the data lines is called a vector and its job is
to uniquely identify the I/O module. This kind of interrupt is called a vectored
interrupt.
o Bus Arbitration I/O module first gains control of the bus and then sends an
interrupt request to the processor. The processor acknowledges the
interrupt request and the I/O module places its vector on the data lines.
Multiple Interrupts: The techniques above not only identify requesting I/O modules
but also provide methods of assigning priority. Where there are multiple
interrupting lines, the processor identifies the line with the highest priority and picks
that line.
o Multiple Lines - The line with the highest priority may be the one that has
just placed a vector containing the data that is needed in the current or next
processor execution cycle.
o Software Polling Polling Order determines priority
o Daisy Chain Daisy Chain order of the modules determines priority
o Bus Arbitration Arbitration scheme determines priority
DMA Operation: The processor issues a command to the DMA module to READ or
WRITE an I/O device address using data lines. The starting [first] memory address is
stored in the address register. The number of words to be transferred using the data
80 | P a g e
lines is stored in the data register. After this, the processor continues with other
work. The DMA transfers the entire block of data one word at a time directly to or
from memory without going through the processor and when the process is
complete, the DMA module sends an interrupt to the processor. Just before the
processor needs the bus, it is suspended and when the DMA transfers one word, it
then returns control to the processor. Since this is not an interrupt, the processor
does not have to save context. The processor executes more slowly, but this is still
far more efficient than either programmed or interrupt-driven I/O.
The DMA architecture can be implemented using a single bus configuration and this single
bus is the System bus. Another variation is the use of a special I/O bus which is the second
bus in addition to the main System bus:
Single Bus Detached DMA Module. In this configuration, each transfer uses the bus
twice I/O to DMA and DMA to memory. The processor is suspended twice.
Single Bus Integrated DMA Module. In this configuration more than one I/O device
may be supported and each transfer uses the bus once DMA to memory and the
processor is suspended just once.
Separate I/O Bus The I/O bus supports all DMA-enabled devices and each transfer
uses the bus once DMA to memory and the processor is suspended once.
81 | P a g e
Self-Assessment
1 List three broad classifications of external or peripheral devices.
2 What is the internal reference alphabet?
3 What are the major functions of an I/O module?
4 List and briefly define three techniques for performing I/O
5 Explain the difference between memory-mapped and isolated I/O.
6 When a device interrupt occurs, how does the processor determine which device issued the
interrupt?
7 When a DMA module takes control of a bus, and while it retains control of the bus, what does the
processor do?
82 | P a g e
Explain the meaning of pipelining and relate to how the concept of pipelining takes
advantage of the multi-stage nature of the basic instruction cycle to maximize resource
utilization
Understand the constraints that undermine the effectiveness of pipelining and how they are
treated to avoid performance drawbacks
Understand pipeline performance and the limitations thereof
Understand, explain, compare and contrast the characteristics, advantages and limitations
of CISC and RISC architectures
Understand Instruction-Level Parallelism and Superscalar Processors
An instruction cycle has a number of stages and it is precisely this fact that makes it possible
for the pipelining strategy to be profitable in the processing of instructions. In a factory,
products at different stages of production can be simultaneously worked on by laying out
the production process in an assembly line. The assembly line philosophy cab be used to
effectively explain how pipelining works.
As a simple example, consider that an instruction has only two stages: the fetch instruction
and the execute instruction. There are times during the execute instruction when main
memory is not being accessed. This time could be used for fetching the next instruction in
parallel with the execution of the current one. The pipeline has two independent stages.
The first stage fetches an instruction and buffers it. When the second stage is free, the first
stage passes it the buffered instruction. While the second stage executes the instruction,
the first stage takes advantage of the fact that memory is not being used and fetches an
instruction and buffers it. This is called instruction prefetch and fetch overlap. In general,
pipelining requires a lot of registers to store data between stages. Figure 6.1 depicts a TwoStage Instruction Pipeline.
83 | P a g e
At this stage it becomes necessary for us to decompose the instruction cycle as follows:
Fetch Instruction (FI): Read the next expected instruction into the buffer
Decode Instruction (DI): Determine the opcode and the operand specifiers
Calculate Operands (CO): Calculate the address of each source operand. This may
involve displacement, register indirect, indirect or other forms of address
calculations
Fetch Operands (FO): Fetch each operand from memory. Operands already in
register need not be fetched.
Execute Instruction (EI): Perform the indicated instruction and store the result, if
any, in the specified destination operand location.
Write Operand (WO): Store the result in memory.
If we use the assumption that each of the above stages can be performed over the exact
same duration, it can be shown that a six-stage pipeline can reduce the execution time for
nine instructions from 54 time units to 14 time units.
Figure 6.2: Timing Diagram A For Six-Stage Instruction Pipeline Operation Involving 9
Instructions.
84 | P a g e
The diagram engenders the assumptions that all stages can be performed in parallel, and
also that there are no memory conflicts. In this environment, the processor makes use of
pipelining to speed up executions while at the same time breaking up the instruction cycle
into a number of separate stages in a sequence. However, the occurrence of branches and
independences between instructions complicates the design and implementation of
pipelining.
6.1 Pipeline Performance And Limitations
A good design goal of any system is that it have all its components performing useful work
at any time so that a high efficiency is obtainable. In this section we develop some simple
measures of pipeline performance and relative speedup. The time cycle T of a pipeline is the
time required to advance a set of instructions one stage through the pipeline. The cycle
time, T, can be determined as:
T = max[Ti] + d = Tm + d
1<=1<=k
Where:
Ti = The time delay of the circuitry in the ith stage of the pipeline
Tm = The maximum delay experienced of all the i stages of the pipeline. This value is
the highest Ti value.
k = The number of stages in the instruction pipeline
d = The time-delay of a latch; that is the delay experienced in advancing data and
signal from one stage to the next
In general, the time delay, d is equivalent to a clock pulse and Tm>>d. Now, suppose that n
instructions are processed, with no branches. Let Tk, n be the total time required for a
pipeline with k stages to execute n instructions. Then,
Tk, n = [k + (n-1)]T.
A total of k cycles is required to execute the first instruction and the remaining n-1
instructions require n-1 cycles.
Now consider a processor with equivalent functions but no pipeline and assume that the
instruction cycle time is kt. The speedup factor for the instruction pipeline compared to
execution without the pipeline is defined as:
Sk = T1, n =
Tk, n
nkT______
[k + (n-1)]T
nk______
k + (n-1)
85 | P a g e
a pipeline hazard. A pipeline hazard occurs when the pipeline or some portion of the
pipeline must stall because conditions do not permit continued execution. A stall can also be
referred to as a pipeline bubble. There are three types of hazards: resource, data and
control.
Resource Hazard: This type of hazard occurs when two or more instructions that are
already in the pipeline require the same resource. The result is that the instructions
will then have to be executed rather in series (serially, that is, one at a time or one
after the other) than in parallel for a portion of the pipeline. A resource hazard is
sometimes referred to as a structural hazard.
Data Hazard: A data hazard occurs when there is a conflict in the access of an
operand location. Consider two instructions of a program that are to be executed in
sequence and both access a particular memory or register operand. If the two
instructions are executed in strict sequence, no problem occurs. However if the
instructions are executed in a pipeline, it is possible that the operand value may be
updated in such a way that it will produce a different result than would occur if the
instructions were being executed sequentially. In other words, the program
produces an incorrect result because of the use of a pipeline.
Control Hazard: A control Hazard, also known as a branch hazard, occurs when the
pipeline makes the wrong decision on a branch prediction and therefore brings
instructions into the pipeline that must subsequently be discarded.
Think Point: It is important to note that a pipeline executes an instruction in stages, and
these stages normally reflect the number of execution steps to complete the execution of a single
instruction. A pipeline facilitates the execution of multiple instructions through the various stages or
phases of execution for each instruction. A pipeline therefore ensures that several instructions are
executed all at once, enhancing system performance.
Reading/Activity: Please read pages 462-479 in your prescribed book to consolidate what
you already know up to this stage about Pipelining.
86 | P a g e
87 | P a g e
On Operands:
o Another study showed that each instruction [DEC-10, in this case] references
0.5 operands in memory and 1.4 in registers
o Implications:
Need for fast operand accessing
Need for optimized mechanisms for storing and accessing local scalar
variables
Execution Sequencing:
o Subroutine calls are the most-time-consuming operations in High Level
Languages.
o There is an identified need to minimize their impact by:
Streamlining parameter passing
Efficient access to local variables
Support nested subroutine invocation
o Statistics:
98% of dynamically called procedures passed fewer than 6
parameters
92% use less than 6 local scalar variables
Rare to have a long sequence of subroutine calls followed by returns
[e.g. a recursive sorting algorithm]
Typically, the depth of nesting was very low.
Implications: The results above suggest that reducing the semantic gap through the
use of complex architectures may not be the most efficient use of system hardware.
The need also exists to optimize machine design based on the most time-consuming
tasks of typical high-level language programs. The necessity for the use of large
numbers of registers to reduce memory references by keeping variables close to the
CPU is made apparent by the findings of the studies under the spotlight. The findings
of the studies also argue for the need to streamline instruction sets rather than
make them more complex.
88 | P a g e
Naively adding registers will not effectively reduce the need to access memory. Since most
operand references are to local scalars, it is better to store them in registers, with perhaps a
few global variables. The problem with this assumption is that during program execution the
definition of local changes with each procedure or method call and return. It is therefore
better to use multiple small sets of registers, each assigned to a different procedure. Global
variables could just use memory, but this could be inefficient for frequently used globals. To
deal with this setback, it will be expedient to incorporate a set of global registers in the CPU.
In this scenario, the registers available to a procedure will be split: some will be the global
registers and the rest will be in the current window.
6.6 A Look At CISC
There are arguments that insist that Complex Instruction Set Computers [CISCs] are a more
effective and efficient option when it comes to getting the best out of a computer system
and advocates of this position argue that CISC tends to offer richer instruction sets and it
accommodates the writing and handling of more instructions in number as well as more
complex instructions to handle complex algorithms such as those involved in scientific and
actuarial work. Because CISC incorporates high-level language syntaxes, it does simplify
compilers and improves the performance of the system. CIS computing has the effect of
minimizing code size which in the end contributes to a reduced instruction execution count.
But because instruction execution count is low, pipelining as a processing technique is less
effective with CISC because pipelining produces more efficient results as the number of
instructions increases.
Under CISC, programs are smaller meaning that they will execute faster due to the fact that
smaller programs have fewer instructions requiring less instruction fetching which translates
to the advantage that they save memory expenditure in the process. In paged environments
smaller programs occupy fewer pages resulting in fewer page faults.
Though CISC programs may be shorter, the bits used for each instruction are more so total
memory used may not necessarily be smaller. This is due to that opcodes require more bits.
Operands also require more bits because they are usually memory addresses [in the
instruction structure], as opposed to register identifiers which are the usual case for RISC.
The CISC architecture incorporates a more complex Control Unit to accommodate seldom
used complex operations and because of this overhead, the more often used simple
operations take longer. The speedup for complex instructions may be mostly due to their
implementation as simpler instructions in microcode, which is similar to the speed of
simpler instructions in RISC [except that the CISC designer must decide priri which
instructions to speed up this way].
6.7 Characteristics Of RISC Architectures
Generally the RISC architecture institutes the processing of one instruction per cycle. A
machine cycle is defined by the time it takes to fetch two operands from registers, perform
an ALU operation and store the result in a register. RISC machine operations should be no
more complicated than, and execute about as fast as microinstructions on a CISC machine.
There is no microcoding that is needed, just simple instructions which will execute faster
than their CISC equivalents due to there being no need to access microprogram control
store. Virtually all machine operations in RISC are register-to-register operations. Only
89 | P a g e
simple LOAD and STORE operations access memory and this simplifies instruction set and
Control Unit design. This kind of design therefore encourages the optimization of register
use to be built into the architecture.
The RISC architecture uses simple addressing modes. Almost all instructions use simple
register addressing although a few other addressing modes such as displacement and PC
Relative may be provided. More complex addressing is implemented in software from the
simpler ones and this kind of design further simplifies the instruction set as well as the
Control Unit.
RISC also uses a few instruction formats which has the effect of building simplicity into the
Control Unit. Instruction length is fixed and aligned on word boundaries. This optimizes
instruction fetching since single instructions do not cross page boundaries. RISC enables
simultaneous opcode decoding and register operand access since field locations [especially
the opcode] are fixed.
The benefits of RISC design are that compilers are effectively optimized and the Control Unit
is in an overall sense simplified and a simpler CU can execute instructions much faster than
a comparable CISC unit. Instruction pipelining can be applied more effectively with a
reduced instruction set.
6.8 RISC Pipelining
The simplified structure of RISC instructions allows us to reconsider pipelining given that
most instructions are register-to-register translating into meaning that an instruction has
two phases [1] Fetch Instruction (FI) and [2] Execute Instruction (EI). The EI phase is an ALU
operation with register input and output. For LOAD and STORE operations, 3 stages are
needed: [1] FI [2] EI, and [3] M(emory). Because the EI phase usually involves an ALU
operation, it may be longer than the other phases and in this case we can divide it into two
sub phases: [1] EI1 Register File Read and, [2] EI2 ALU operation and Register write.
6.9 Optimization Of The Pipeline
Delayed Branch: We have seen that data and branch dependencies reduce the
overall execution rate in the pipeline. Delayed branch makes use of a branch that
does not take effect until after the execution of the following instruction. Note that
the branch takes effect during the EI phase of this following instruction so the
instruction location immediately following the branch is called the delay slot. This is
because the instruction-fetching order is not affected by the branch until the
instruction after the delay slot. Rather than wasting an instruction with a NOOP, it
may be possible to move the instruction preceding the branch to the delay slot while
still retaining program semantics.
Conditional Branches: If the instruction immediately preceding the branch cannot
alter the branch condition, this optimization can be used/applied otherwise a NOOP
is still required.
Delayed Load: On load instructions, the register to be loaded is locked by the
processor. The processor continues execution of the instruction stream until it
reaches an instruction needing a locked register. It then idles until the load is
90 | P a g e
Reading/Activity: Please read pages 499-529 in your prescribed book to consolidate what
you already know up to this stage about Reduced Instruction Set Computing.
Assessment exercise
1 What are some typical distinguishing characteristics of RISC organization?
2 Briefly explain the two basic approaches used to minimize register-memory operations on RISC
machines
3 What are the typical characteristics of a RISC instruction set architecture?
4 What is a delayed branch?
91 | P a g e
References
Stallings, W., (2013), Computer Organization And Architecture, 8e. Pearson Education, NJ.
Englander, I., (2010), The Architecture Of Computer Hardware, Systems Software & Networking, 4e.
John Wiley & Sons, Asia.
Digital Electronics Basics: Logic Gates And Boolean Algebra, 2013. Available from:
http://www.ni.com/multisim/try/ [25 November 2013]
Dodge, NB 2012, Boolean Algebra And Combinational Data Logic. Available from:
http://www.utdallas.edu/~dodge/EE2310/lec4.pdf [November 2013]
Nguyen, THL 2009, Computer Architecture. Available from:
http://www.cnx.org/content/col10761/1.1/ [17 November 2013]
Introduction 2 An Overview Of Cache Available from:
http://www.download.intel.com/design/intarch/papers/cache6.pdf [28 November 2013]
92 | P a g e