Computer Architecture AllClasses-Outline
Computer Architecture AllClasses-Outline
Structure:
1.1 Introduction
Objectives
1.2 Computational Model
The basic items of computations
The problem description model
The execution model
1.3 Evolution of Computer Architecture
1.4 Process and Thread
Concept of process
Concept of thread
1.5 Concepts of Concurrent and Parallel Execution
1.6 Classification of Parallel Processing
Single instruction single data (SISD)
Single instruction multiple data (SIMD) Multiple instruction single
data (MISD) Multiple instruction multiple data (MIMD)
1.7 Parallelism and Types of Parallelism
1.8 Levels of Parallelism
1.9 Summary
1.10 Glossary
1.11 Terminal Questions
1.12 Answers
1.13 Introduction
As you all know computers vary greatly in terms of physical size, speed of
operation, storage capacity, application, cost, ease of maintenance and
various other parameters. The hardware of a computer consists of physical
parts that are connected in some way so that the overall structure achieves
the pre-assigned functions. Each hardware unit can be viewed at different
levels of abstraction. You will find that simplification can go on to still deeper
levels. You will be surprised to know that many technologies exist for
manufacturing microchips.
• De-multiplexers
• Coders
• Decoders
• I/O Controllers
A common foundation or paradigm that links the computer architecture and
language groups is called a Computational Model. The concept or idea of
computational model expresses a higher level of abstraction than can be
achieved by either the computer architecture or the programming language
alone, and includes both.
The computational model consists of the subsequent three abstractions:
1. The basic items of computations
2. The problem description model
3. The execution model
Unlike the ordinary delusions, the set of abstractions that must be selected to
state computational models is not very clear. Some criteria will define fewer
but relatively basic computational models, while a wide variety of criteria will
result in a fairly a huge quantity of different models.
3.1.1 The basic items of computations
This concept recognises the basic items of computation. This is a requirement
of the items to which the computation is referred and the sort of computations
(operations) that are executed on them. For example, in the von Neumann
computational model, the fundamental items of computation are data.
This data will normally be characterised by individual bodies so as to be
capable of distinguishing among several different data items in the course of
a computation. These identifiable bodies are commonly called variables in
programming languages and are put into operation by register addresses or
memory in architectures.
The acknowledged computational models, such as Turing model, the von
Neumann model and the data flow model stand on the theory of data. These
models are briefly explained as below:
The Turing machine architecture operates by manipulating symbols on a
tape. In other words, a tape with innumerable slots exists, and at any one point
in time, the Turing machine is in a specific slot. The machine can change the
symbol and shift to a different slot based on the symbol read at that slot. All of
this is inevitable.
The von-Neumann architecture explains the stored-program computer
where data and instructions are stored in memory and the machine works by
varying its internal state, In other words, an instruction operates on some data
and changes the data. So naturally, there is a state maintained in the system.
Dataflow architecture expressively distinguishes the conventional von
Neumann architecture or control flow architecture. There is a lack of a program
counter in Dataflow architectures. The execution of instructions in dataflow
systems is exclusively concluded depending on the accessibility of input
arguments to the instructions. Even though dataflow architecture has not been
used in any commercially successful computer hardware, it is extremely
appropriate in many software architectures such as database engine designs
and parallel computing frameworks.
On the other hand, there are various models independent of data. In these
models, the basic items of computation are:
• Messages or objects sent to them needing an associated manipulation (as
in the object-based model)
• Arguments and the functions applied on them (applicative model)
• Elements of sets and the predicates declared on them (predicate-logic-
based model).
1.2.2 The problem description model
The problem description model implies in cooperation the style and method of
problem description. The problem description style specifies the way troubles
in a specific computational model are expressed. The style is either procedural
or declarative. The algorithm to work out the problem is shown in a procedural
style. A particular result is then stated in the form of an algorithm. In a
declarative style, all the facts and dealings significant to the specified problem
have to be stated.
There are two modes for conveying these relationships and facts. The first
employs functions, as in the applicative model of computation, while the
second declares the relationships and facts in the form of predicates, as in the
predicate-logic-based computational model. Now, we will study the second
component of the problem description model that is the problem description
method. It is understood in a different way for the procedural and the
declarative style. In the procedural style, the problem description model states
the way in which the clarification of the known problem has to be explained.
On the contrary, while using the declarative style, it states the method in which
the difficulty itself has to be explained.
1.2.3 The execution model
This is the third and the final constituent of computational model. It can be
divided into three stages.
• Interpretation of how to perform the computation
• Execution semantics
• Control of the execution sequences
The first stage pronounces the analysis of the computation, which is strongly
linked to the problem description method. The selection of problem description
method and the analysis of the computation are mutually dependent on one
another.
The subsequent stage of the execution model states the execution semantics.
This is taken as a rule that identifies the way a particular execution step is to
be performed. This rule is, certainly, linked with the selected problem
description method and the way the implementation of the computation is
understood. The final stage of the model states the rule of the execution
sequences. In the basic models, implementation is either control driven or data
driven or demand driven.
• In Control driven execution, it is supposed that there is a program
consisting of a succession of instructions. The execution sequence is then
absolutely specified by the command of the directions.
Nevertheless, explicit control instructions can also be used to identify an
exit from the implied execution sequence.
• Data-driven execution is symbolised by the rule that an operation is made
active instantly after all the needed input data is available. Data- driven
execution control is characteristic of the dataflow model of computation.
• In Demand-driven execution, the operations will be made active only when
their implementation is required to attain the ultimate result. Demand-
driven execution control is normally used in the applicative computational
model.
Self Assessment Questions
1. The _________ model refers to both the style and method of problem
description.
IAS machine was a new version of the EDVAC, which was built by von
Neumann. The basic design of IAS machine is now known as von Neumann
machine, which had five basic parts - the memory, the arithmetic logic unit, the
program control unit, the input and output unit as shown in figure 1.2.
Activity 1:
Using the Internet, find out about Fifth Generation Computer Systems
project (FGCS), idea behind it, implementation, timeline and outcome
tN t3 t2 t1
Figure 1.8: Parallel Computing Systems
one clock-cycle.
This is the oldest and of late, the most widespread structure of computer.
Examples: Most PCs, single CPU workstations and mainframes.
Figure 1.9 shows an example of SISD.
and united into groups which are then acted upon in parallel without altering
the outcome of the program. This is known as instructionlevel parallelism.
Advances in instruction-level parallelism dominated computer architecture
from the mid-1980s until the mid-1990s.
Data parallelism: Data parallelism is parallelism inbuilt in program loops. It
centres at allocating the data across various computing nodes to be processed
in parallel. Parallelising loops recurrently leads to related (not necessarily
identical) operation sequences or functions being performed on elements of a
large data structure. Many scientific and engineering applications display data
parallelism.
Self Assessment Questions
15. Parallel computers offer the potential to concentrate computational
resources on important computational problems. (True/ False)
16. Advances in instruction-level parallelism dominated computer architecture
from the mid-1990s until the mid-2000s. (True/False)
different sizes of granularity. In this respect, we can identify the following four
levels and corresponding granularity sizes:
• Parallelism at the instruction level (fine-grained parallelism): Available
instruction-level parallelism means that particular instructions of a program
may be executed in parallel. To this end, instructions can be either
assembly (machine-level) or high-level language instructions. Usually,
instruction-level parallelism is understood at the machinelanguage
(assembly-language) level.
• Parallelism at the loop level (middle-grained parallelism): Parallelism
may also be available at the loop level. Here, consecutive loop iterations
are candidates for parallel execution. However, data dependencies
between subsequent loop iterations, called recurrences, may restrict their
parallel execution.
• Parallelism at the procedure level (middle-grained parallelism): Next,
there is parallelism available at the procedure level in the form of parallel
executable procedures. The extant of parallelism exposed at this level is
subject mainly to the kind of the problem solution considered.
• Parallelism at the program level (coarse-grained parallelism): Lastly,
different programs (users) are obviously independent of each other. Thus,
parallelism is also available at the user level (which we consider to be
coarse-grained parallelism). Multiple, independent users are a key source
of parallelism occurring in computing scenarios.
Utilisation of functional parallelism: Available parallelism can be utilised by
architectures, compilers and operating systems conjointly for speeding up
computation. Let us first discuss the utilisation of functional parallelism.
In general, functional parallelism can be utilised at four different levels of
granularity, that is,
• Instruction
• Thread
• Process
• User level
It is quite natural to utilise available functional parallelism, which is inherent in
a conventional sequential program, at the instruction level by executing
instructions in parallel. This can be achieved by means of architectures
capable of parallel instruction execution. Such architectures are referred to as
instruction-level function-parallel architectures or simply instruction-level
parallel architectures, commonly abbreviated as ILP-architectures.
Activity 2:
Decide which architecture is most appropriate for a given application. First
determine the form of parallelisation which would best suit the application, then
decide both hardware and software for running your parallelised application
1.9 Summary
Let us recapitulate the important concepts discussed in this unit:
• Computer Architecture deals with the issue of selection of hardware
components and interconnecting them to create computers that achieve
specified functional, performance and cost goals.
• The concept of a computational model represents a higher level of
abstraction than can be achieved by either the computer architecture or
the programming language alone, and covers both.
• History of computers begins with the invention of the abacus in 3000 BC,
followed by the invention of mechanical calculators in 1617. Fifth
generation computers are still under research and development.
• Each process provides the resources needed to execute a program.
• A thread is the entity within a process that can be scheduled for execution.
• Concurrent execution is the temporal behaviour of the N-client 1-server
model where one client is served at any given moment.
1.10 Glossary
• EDSAC: Electronic Delay Storage Automatic Calculator
• EDVAC: Electronic Discrete Variable Automatic Computer
• ENIAC: Electronic Numerical Integrator and Calculator
• IC: Integrated Circuit where hundreds of transistors could be put on a
single small circuit.
• LSI: Large Scale Integration, it can pack more than a million transistors
• MSI: Medium Scale Integration, it packs as many as 100 transistors
• PCB: Process Control Block, it is a description table which contains all the
information relevant to the whole life cycle of a process.
• SSI: Small Scale Integration, it can pack 10 to 20 transistors in a single
chip.
• UNIVAC I: Universal Automatic Calculator
• USLI: Ultra Large-Scale Integration, it contains millions of components on
a single IC
• VLSI: Very Large Scale Integration, it can have up to 1000 transistors
1.11 Terminal Questions
1. Explain the concept of Computational Model. Describe its various types.
2. What are the different stages of evolution of Computer Architecture?
Explain in detail.
3. What is the difference between process and thread?
4. Explain the concepts of concurrent and parallel execution.
5. State Flynn’s classification of Parallel Processing.
6. Explain the types of parallelism.
7. What are the various levels of parallelism?
1.12 Answers
Self Assessment Questions
1. Problem description
2. Procedural style
3. Data-driven
4. Pascaline
5. IAS machine
6. False
7. True
8. True
9. False
10. N-client 1-server
11. True
12. Single Instruction Multiple Data
13. Multiple Instruction Single Data
14. Multiple Instruction Multiple Data
15. True
16. False
17. Utilised parallelism
18. False
19. True
Terminal Questions
1. A common foundation or paradigm that links the computer architecture
and language classes is called a Computational Model. Refer Section 1.2.
2. History of computers begins with the invention of the abacus in 3000 BC,
followed by the invention of mechanical calculators in 1617. The years
beyond 1642 till 1980 are marked by inventions of zeroth, first, second and
third generation computers. Refer Section 1.3.
3. A thread is the entity within a process that can be scheduled for execution.
Refer Section 1.4.
4. Concurrent execution is the temporal behaviour of the N-client 1-server
model where one client is served at any given moment. Parallel execution
is associated with N-client N-server model. Refer Section 1.5.
5. Flynn classifies the computer system into four categories. Refer Section
1.6.
6. There are three types of parallelism. Refer section 1.7.
7. The notion of parallelism is used in two different contexts. Either it
designates available parallelism in programs or it refers to parallelism
References:
• Hwang, K. (1993) Advanced Computer Architecture. McGraw-Hill, 1993.
• D. A. Godse & A. P. Godse (2010). Computer Organization. Technical
Publications. pp. 3-9.
• John L. Hennessy, David A. Patterson, David Goldberg (2002)
"Computer Architecture: A Quantitative Approach", Morgan Kaufmann; 3rd
edition.
• Dezso Sima, Terry J. Fountain, Peter Kacsuk (1997) Advanced computer
architectures - a design space approach. Addison-Wesley- Longman: I-
XXIII, 1-766
E-references:
• www.cs.clemson.edu/~mark/hist.html
• www.people.bu.edu/bkia/
• www.ac.upc.edu/
• www.inf.ed.ac.uk/teaching/courses/car/
Structure:
2.1 Introduction
Objectives
2.2 Changing Face of Computing
Desktop computing
Servers
Embedded computers
2.3 Computer Designer
2.4 Technology Trends
2.5 Quantitative Principles in Computer Design
Advantages of parallelism
Principle of locality
Focus on the common case
2.6 Power Consumption
2.7 Summary
2.8 Glossary
2.9 Terminal Questions
2.10 Answers
2.1 Introduction
In the previous unit, you studied about the computational model and the
evolution of computer architecture. Also, you studied the concept of process
thread. We also covered two types of execution - concurrent and parallel and
also the types and level of parallelism. In this unit, we will throw light on the
changing face of computing, the task of computer designer and its quantitative
principles. We will also examine the technology trends and understand the
concept of power consumption and efficiency of the matrix.
You can define computer design as an activity that converts the architecture
design of the computer into a programming structure implementation of a
particular organisation. Thus, computer design is also referred to as computer
implementation. Computer designer is responsible for the hardware
architecture of the computer.
Objectives:
After studying this unit, you should be able to:
• identify the changing face of computing
• explain the tasks of the computer designer
• describe the technology trends
• discuss the quantitative principles of the computer design
• describe power consumption and efficiency of the matrix
These changes have dramatically changed the face of computing and the
computing applications. This has led to three different computer markets each
adapted with different requirements, specifications and applications. These
are explained as follows:
2.2.1 Desktop computing
Desktop computers have the largest market in terms of costs. It varies from
low-end systems to very high-end heavily configured computer systems.
Throughout this range the cost and the competence also varies in terms of
performance. This blend of the performance and the price concerns most to
the customers in the market and thus, to the computer designers.
Consequently, the latest, the highest-performance microprocessors and cost-
reduced microprocessors are largely sold in the category of the desktop
systems.
Characteristics of desktop computing
The important characteristics of desktop computing are:
1. Ease-of-use: In desktop computers, all the computer parts come as
separate detachable components of the computer. Thus, making the use
easy and comfortable for the user.
2. Extensive graphic capabilities: It provides extensive graphics
crucial, but the chief objective is to meet the performance need at the minimum
cost.
Characteristics of embedded computers
1. Real-time performance: The performance requisite in an embedded
application is real-time execution. Speed, though in varying degrees, is an
important factor in all architectures. The ability to assure real-time
performance acts as a constraint on the speed needs of the system. Real-
time performance means that the agent is assured to perform within
certain time restraints as specified by the task and the environment.
2. Soft real-time: In a number of applications, a more advanced requisite
exists: the standard time for a particular job is constrained and the number
of occurrences when the maximum time is exceeded. Such techniques are
occasionally called soft real-time and they occur when it is possible to
sometimes miss the time limitation on an incident, provided that not plenty
of them are missed.
3. Need to minimise memory size: Memory can be a considerable element
of the system cost. Thus, it is vital to limit the memory size according to
the requirement.
4. Need to minimise memory power: Larger memory also means high
power need. Emphasis on low power is made by the use of batteries.
Unnecessary usage of power needs to be avoided to keep the power need
low.
Self Assessment Questions
5. The __________ had the ability to integrate the functions of a
computer’s Central Processing Unit (CPU) on a single-integrated circuit.
6. _____________ computers used to support typical applications like
business data support and large-scale scientific computing.
7. The performance requirement in an embedded application is real-time
execution. (True/False)
8. ______________ is the chief objective of embedded computers.
computer architects. The world’s first designer was Charles Baggage, (1791
- 1871) (See Figure 2.2). He is considered as the father of computers and
holds the credit of inventing the first mechanical computer that eventually led
to more complex designs.
Now, we will discuss the low-level implementation of the 80x86 instruction set.
Computers cannot execute high-level language constructs like ones found in
C. Rather they execute a relatively small set of machine instructions, such as
addition, subtraction, Boolean operations, and data transfers. Thus, the
engineers decided to encode the instructions in a
numeric format (suitable for storage in memory). The structure of the ISA is
given below:
1. Class of ISA: The operands are registers or memory locations and
approximately all ISAs are now categorised as general-purpose register
architectures. The 80x86 has 16 general-purpose registers and 16 registers
for floating-point data. The two accepted editions of this class are register-
memory ISAs, which can access memory only with load or store-instruction.
Figure 2.4 shows the structure of a programming model consisting of General
Purpose Registers and Memory.
—
I load | reg | address I
—
Even this needs large space in an instruction for large address. The address
is the beginning of an array and the particular array element needed could be
selected by the index.
iii) Base plus index plus offset
The beginning address of the array could be stored in the base register, the
index will choose the particular record needed and the offset can choose the
field inside that record.
iv) Scaled
The beginning of an array or vector is stored in the base register and the index
could contain number of the particular array element needed.
v) Register Indirect
This is a distinctive addressing mode. Many computers just use base plus
offset with an offset value of 0.
4. Types and sizes of operands: Machine instructions are operated on
operands of several types. Some types supported by ISAs include
character (e.g., 8-bit ASCII or 16-bit Unicode), signed and unsigned
integers, and single- and double-precision floating-point numbers. ISAs
typically support various sizes for integer numbers.
For example, arithmetic instructions which operate on 8-bit integers 16- bit
integers (short integers), and 32-bit integers are included in a 32-bit
architecture. Signed integers are represented using two’s complement
binary representation.
Here, in this unit, the word architecture covers all three aspects of computer
design - instruction set architecture, organisation, and hardware. Thus,
computer designers must design a computer keeping in mind the functional
requirements as well as price, power, performance, and goals. The functional
requirements also have to be determined by the computer architect, which is
a tedious job. The requirements are determined after reviewing the market
specific features. Also, the computer designers must be aware of the
technology trends in the market and the use of computers to avoid
unnecessary costs and failure of the architecture system. Thus, we will study
some important technology trends in the following section.
Self Assessment Questions
5. The world’s first designer was __________________
6. _________________ acts as the boundary between software and
hardware.
7. ISA has __________________ general-purpose registers.
8. CISC stands for __________________ .
Activity 1:
Visit any two organisations. Now make a list of the different type of computers
they are using - desktop, servers and embedded computers - and compare
with one another. What proportion of each type of computing are they using?
the dynamic and rapidly changing market. The instruction set should be
designed such to adapt the rapid changes of the technology. The designer
should plan for the technology changes that would lead to the success of the
computer.
There are mainly four main changes that are essential to modern
implementations. These are as follows:
1. Integrated circuit logic technology: Integrated circuits or microchips are
electronic circuits manufactured by forming interconnections between
semiconductor devices. Changes in this technology occur very soon.
Some examples are the evolution of mobile phones, digital microwave
ovens, etc.
2. Semiconductor DRAM (dynamic random-access memory): DRAM
uses a capacitor to store each bit of data, and the level of charge on each
capacitor determines whether that bit is a logical 1 or 0. However these
capacitors do not hold their charge indefinitely, and therefore the data
needs to be refreshed periodically. It is a semiconductor memory that is
equipped in personal computers and workstations. It increases by about
40% every year.
3. Magnetic disk technology: Magnetic disks include floppy disks, compact
disks, hard disks, etc. The disk facing the drive is coated with magnetic
particles into microscopic areas called domains. These domains acts like
a tiny magnet with north and south poles. This technology is currently
increasing 30% every year.
4. Network technology: Networks may be referred to the range of
computers and its hardware components connected together through the
communication channels. Communication protocols lead the
communication in the network and provide the basis for network
programming. The performance depends both on the switches and the
transmission systems.
These rapidly changing technologies mould the design of the computer that
will have a life of more than five years. It has been observed that with the help
of the study of these technology trends, the computer designers have been
able to reduce the costs at the rate at which the technology changes.
Self Assessment Questions
9. The designer should never plan for the technology changes that would
the level of digital designing. Caches that are usually looked for in parallel use
multiple memory banks to find a desired item. Modern ALUs use parallelism
to increase their speed of the process of calculating sums from linear to
logarithmic in the number.
2.5.2 Principle of locality
Principle of locality is an important program property as programs tend to
reuse the data and instructions they have already used. Principle of locality
follows the rule that it can help us foresee the data and instructions that a
program might require in the near future. This forecast is made depending on
the trend of usage of the data and instructions in its history.
There are two different kinds of localities: Temporal Locality which declares
that the items referred in the recent times are potential to be accessed in the
near future and Spatial Locality which states that the items nearby the location
of the recently used items may also be referred close together in time. The
localities are stored in a component called cache memory, which is located
between the CPU (or processor) and the main memory as shown in the figure
2.7.
instruction fetching and coding unit of a processor first, as it may be used more
often than a multiplier. It works on dependability as well.
The optimising of the recurrent case is more beneficial and faster than the
non-recurrent case. It is simpler too. For example, it is rare that an overflow
may occur when adding any two numbers in the processor. Thus, it improves
the performance by optimising the more common case of no overflow. To
apply this principle, all we need to do is analyse what the common case is and
what level of performance can be achieved by improving its speed. To quantify
this, we will study the Amdahl’s Law below.
Amdahl’s Law
This law helps compute the performance gain that can be obtained by
improving any division of the computer. Amdahl’s law states that “the
performance improvement to be gained from using some faster mode of
execution is limited by the fraction of the time the faster mode can be used.”
(Hennessey and Patterson)
Figure 2.8 shows the predicted speed using Amdahl’s law in a graphic form.
Or,
Execution time for entire task without using the enhancement
Speedup = -------------------------------------------------------------------------------------
Execution time for entire task using the enhancement when possible
Amdahl’s law helps us to find the speedup from some enhancement. This
depends on the following two factors:
1. The fraction of the computation time in the original computer that can be
converted to take advantage of the enhancement - For example, if 20
seconds of the execution time of a program that takes 60 seconds in total
can use an enhancement, the fraction is 20/60. This value, which we will
call Fraction enhanced, is always less than or equal to 1.
2. The improvement gained by the enhanced execution mode; that is, how
much faster the task would run if the enhanced mode were used for the
entire program - This value is the time of the original mode over the time
of the enhanced mode. If the enhanced mode takes, say, 2 seconds for a
portion of the program, while it is 5 seconds in the original mode, the
improvement is 5/2. We will call this value, which is always greater than 1,
Speedup enhanced.
The execution time using the original computer with the enhanced mode will
be the time spent using the unenhanced portion of the computer plus the time
spent using the enhancement:
2.7 Summary
Let us recapitulate the important concepts discussed in this unit:
• There are two types of execution - concurrent and parallel.
• Computer design is an activity that converts the architecture design of the
computer into a programming structure implementation of a particular
organisation.
• Computer technology has made drastic changes in the past 60 years when
the first general-purpose computer was invented.
• Desktop computers have the largest market in terms of costs. It varies
from low-end systems to very high-end heavily configured computer
systems.
• The world’s first designer was Charles Baggage and is considered as the
father of computers.
• Computer designer needs to determine the attributes that are necessary
for a new computer, then design a computer to maximise the performance.
• The Instruction Set Architecture (ISA) is the part of the processor that is
visible to the programmer or compiler writer.
• Performance of the computer is improved by taking advantage of
parallelism.
• Focussing on the common case will work positively both for power and
resource allocation, thus, leading to advancement.
2.8 Glossary
• CISC: Complex instruction set computer
• Computer designer: A person who design CPUs or computers that are
actually built and are into considerable use and influence the further
development of computer designs.
• Desktop computers: These are in the form of personal computers (also
2.10 Answers
Self Assessment Questions
1. Microprocessor
2. Main-frame
3. True
4. Minimum cost
5. Charles Baggage
6. ISA
7. 16
8. Complex instruction set computer
9. False
10. Integrated circuits or microchips
11. Adopting parallelism
12. Scalability
13. Pipelining
14. Temporal Locality
15. Spatial Locality
16. Power efficiency
17. Dynamic Power
18. High issue rates, sustained performance
Terminal Questions
1. Desktop computers have the largest market in terms of costs. It varies from
low-end systems to very high-end heavily configured computer systems.
Refer Section 2.2.
2. An embedded system is a single purpose computer embedded in a devise
to control some particular function of that bigger devise. The performance
requirement of an embedded application is real-time execution. Refer
Section 2.2.
3. Computer Designer is a person who has designed CPUs or computers that
were actually built and came into considerable use and influenced the
further development of computer designs. Refer Section 2.3.
4. Architecture covers all three aspects of computer design - instruction set
architecture, organisation, and hardware. Refer Section 2.3.
5. Technology trends need to be studied on a regular basis in order to cope
with the dynamic and rapidly changing market. The instruction set should
be designed such to adapt the rapid changes of the technology. Refer
Section 2.4.
6. Quantitative principles in computer design are: Take Advantage of
Parallelism, Principle of Locality, Focus on the Common Case and
Amdahl’s Law. Refer Section 2.5.
7. Amdahl’s law states that the performance improvement to be gained from
using some faster mode of execution is limited by the fraction of the time
the faster mode can be used. Refer Section 2.5.
References:
• David Salomon, (2008), Computer Organisation, NCC Blackwell.
• John L. Hennessy and David A. Patterson, Computer Architecture: A
Quantitative Approach, (4th Ed.), Morgan Kaufmann Publishers
3.1 Introduction
In the previous unit, you have studied about fundamentals of computer
architecture and design. Now we will study in detail about the instruction set
and its principles.
The instruction set or the instruction set architecture (ISA) is the set of basic
instructions that a processor understands. In other words, an instruction set,
or instruction set architecture (ISA), is the part of the computer architecture
related to programming, including the native data types, instructions, registers,
addressing modes, memory architecture, interrupt and exception handling,
and external I/O. There are a number of instructions in a program that have to
be accessed in a particular sequence. This encourages us to describe the
issue of instruction and its sequence which we will study in this unit. In this
unit, you will study the fundamentals involved in instruction set architecture
and design. Firstly, the operations in the instruction sets, instruction set
architecture, memory locations and addresses, memory addressing, abstract
model of the main memory, and instructions for control flow need to be
push addr
Places the value at address addr on top of the stack push([addr])
pop addr Stores the top value on the stack at memory address addr M(addr) =
pop
add
Adds the top two values on the stack and pushes the result onto the
stack pusbtpop ♦ pop)
sub
Subtracts the second top value horn the top value of the stack and
pushes the result onto the stack puslrtpop pop)
mult
Multiplies the top two values in the stack and pushes the result onto tlie
stack pusbtpop ♦ pop)
1 ।
The branch and jump instructions are identical in their use but sometimes they
are used to denote different addressing modes. The branch is usually a one-
address instruction. Branch and jump instructions may be conditional or
unconditional.
An unconditional branch instruction, as a name denotes, causes a branch to
the specified address without any conditions. On the contrary the conditional
branch instruction specifies a condition such as branch if positive or branch if
zero. If the condition is met, the program counter is loaded with the branch
address and the next instruction is taken from this address. If the condition is
not met, the program counter remains unaltered and the next instruction is
taken from the next location in sequence.
The skip instruction does not require an address field and is, therefore, a zero-
address instruction. A conditional skip instruction will skip the next instruction,
if the condition is met. This is achieved by incrementing the program counter
during the execute phase in addition to its being incremented during the fetch
phase. If the condition is not met, control proceeds with the next instruction in
sequence where the programmer inserts an unconditional branch instruction.
Thus, a skip-branch pair of instructions causes a branch if the condition is not
met, while a single conditional branch instruction causes a branch if the
condition is met.
The call and return instructions are used in conjunction with subroutines. The
compare instruction performs a subtraction between two operands, but the
result of the operation is not retained. However, certain status bit conditions
are set as a result of the operation. In a similar fashion, the test instruction
performs the logical AND of two operands and updates certain status bits
without retaining the result or changing the operands. The status bits of
interest are the carry bit, the sign bit, a zero indication, and an overflow
condition.
The four status bits are symbolised by C, S, Z, and V. The bits are set or
cleared as a result of an operation performed in the ALU.
1. Bit C (carry) is set to 1 if the end carry C8 is 1 .It is cleared to 0 if the carry
Manipal University Jaipur B1648 Page No. 70
Computer Architecture Unit 1
is 0.
2. Bit S (sign) is set to 1 if the highest-order bit F7 is 1. It is set to 0 if the bit
is 0. s=0 defines positive number and s=1 defines negative number.
3. Bit Z (zero) is set to 1 if the result of the ALU contains all 0’s. It is cleared
to 0 otherwise. In other words, Z = 1 if the result is zero and Z = 0 if the
result is not zero.
4. Bit V (overflow) is set to 1 if the exclusive-OR of the last two carries is equal
to 1, and cleared to 0 otherwise. This is the condition for an overflow when
negative numbers are in 2’s complement. For the 8-bit ALU, V = 1 if the
result is greater than +127 or less than -128.
As you can see in figure 3.5, the status bits can be checked after an ALU
operation to determine certain relationships that exist between the values of A
and B. If bit V is set after the addition of two signed numbers, it indicates an
overflow condition.
Activity 2:
Visit a computer hardware store and try to collect as much information as
possible about the MIPS processor. Compare its features with other
processors.
3.8 Summary
• Each computer has its own particular instruction code format called its
Instruction Set.
• The different types of instruction formats are three-address instructions,
two-address instructions, one-address instructions and zero-address
Manipal University Jaipur B1648 Page No. 73
Computer Architecture Unit 1
instructions.
• A distinct addressing mode field is required in instruction format for signal
processing.
• The program is executed by going through a cycle for each instruction.
• The prototype chip of MIPS architecture demonstrated that it is possible to
integrate a microprocessor with five-stage execution pipeline and cache
controller into a single silicon chip.
3.9 Glossary
• Cell: The smallest unit of memory that the CPU can read or write is cell.
• Decoding: It means interpretation of the instruction.
• Fields: Groups containing bits of instruction.
• Instruction set: Each computer has its own particular instruction code
format called its Instruction Set.
• MIPS: Microprocessor without Interlocked Pipeline.
• Operation: It is a binary code that instructs the computer to perform a
specific operation.
• RISC: Reduced Instruction Set Computer
• Words: Hardware-accessible units of memory larger than one cell are
called words.
3.11 Answers
Self Assessment Questions
1. Fields
2. One-address instructions
3. False
4. True
5. Zero-address
Terminal Questions
1. Each computer has its own particular instruction code format called its
Instruction Set. Refer Section 3.2.
2. The different types of instruction formats are three-address instructions,
two-address instructions, one-address instructions and zero-address
instructions. Refer Section 3.2.
3. Memory addressing is the logical structure of a computer’s randomaccess
memory (RAM). Refer Section 3.3.
4. A distinct addressing mode field is required in instruction format for signal
processing. Refer Section 3.4.
5. The program is executed by going through a cycle for each instruction.
Each instruction cycle is now subdivided into a sequence of sub cycles or
phases. Refer Section 3.5.
6. The conditions for altering the content of the program counter are specified
by program control instruction, and the conditions for data- processing
operations are specified by data transfer and manipulation instructions.
Refer Section 3.6.
7. After considerable research on efficient processor organisation and VLSI
integration at Stanford University, the MIPS architecture evolved. Refer
Section 3.7.
References:
• Hwang, K. (1993) Advanced Computer Architecture. McGraw-Hill, 1993.
• D. A. Godse & A. P. Godse (2010). Computer Organization. Technical
Publications. pp. 3-9.
• John L. Hennessy, David A. Patterson, David Goldberg (2002)
"Computer Architecture: A Quantitative Approach", Morgan Kaufmann; 3rd
edition.
4.1 Introduction
In the previous unit, you studied about the changing face of computing. Also,
you studied the meaning and tasks of a computer designer. We also covered
the technology trends and the quantitative principles in computer design. In
this unit, we will introduce you to pipelining processing, the pipeline hazards,
structural hazards, control hazards and techniques to handle them. We will
also examine the performance improvement with pipelines and understand the
effect of hazards on performance.
A parallel processing system can carry out concurrent data processing to
attain quicker execution time. For example, as an instruction is being executed
4.2 Pipelining
An implementation technique by which the execution of multiple instructions
can be overlapped is called pipelining. This pipeline technique splits up the
sequential process of an instruction cycle into sub-processes that operates
concurrently in separate segments. As you know computer processors can
execute millions of instructions per second. At the time one instruction is
getting processed, the following one in line also gets processed within the
same time, and so on. A pipeline permits multiple instructions to get executed
at the same time. Without a pipeline, every instruction has to wait for the
previous one to be complete. The main advantage of pipelining is that it
increases the instruction throughput, which is defined as the number of
In the figure, each segment has one or three registers with combinational
circuits. Each register is loaded with a new data on start of new time segment.
Refer table 4.2 for an example of contents of Registers in Pipeline.
On 1st clock pulse, data is loaded in registers R1, R2, R3, and R4.
On 2nd clock pulse, product is stored in registers R5 and R6.
On 3rd clock pulse, the data in R5, R6 are added and stored in R7.
So it required a total of 3 clock periods only, to compute An* Bn + Cn* Dn.
Table 4.2: Contents of Registers in Pipeline Example
Segment 1 Segment 2 Segment 3 Segment 4
Clock Pulse
R1 R2 R3 R4 R5 R6 R7
1 A1 B1 C1 D1 - -
2 A2 B2 C2 D2 A1*B1 C1*D1
3 A3 B3 C3 D3 A *B C *D A *B +C *D
4 - - - - A *B C *D A *B +C *D
5 - - - - - - A *B +C *D
In this three-stage pipeline, the input data must go through stages 1, 2 and 3
to perform multiplication and through stages 1 and 3 only to perform
subtraction. Therefore, dynamic pipelines require feed forward and feedback
connections in addition to the streamline connections between the stages.
Branch successor + 4 IF ID
Branch successor + 5 IF
Figure 4.6: Three-Cycle Stall in the Pipeline
The control hazard stall is not implemented in the same way as the data
hazard stall, since the instruction fetch (IF) cycle is to be repeated as soon the
branch cycle is known. Thus, the first IF cycle is definitely a stall, as it never
performs essential tasks. By setting the ID/IF to zero, we can implement the
stall for the three cycles. The repetition of the IF stage is not required, if the
branch is untaken, since the correct instruction may already have been
fetched.
Self Assessment Questions
14. _________ cause a greater performance failure for a pipeline than
15. If the PC is changed by the branch to its target address, then it is known
as __________________ branch; else it is known as __________ .
Instruction i * 2 IF ID EX MEM WB
Instruction i + 3 IF ID EX MEM WB
Instruction i + 4 IF ID EX MEM WB
In reality, there is only a single instruction delay in all machines with delayed
branch, and we emphasize on that case.
Self Assessment Questions
16. The problem posed due to data hazards can be solved with a simple
hardware technique called __________________ .
17. Forwarding is also called _________ or _________________ .
18. ____________ is the method of holding or deleting any instructions
after the branch until the branch destination is known.
19. ________________ technique simply allows the hardware to
continue as if the branch were not executed.
such, the cycle count should be equal to the sum of these three registers.
• In the dual-issue processor, only one of the instruction count, load stall, or
branch stall counters is increased, but the instruction count register may
sometimes be incremented by two (for cycles in which two instructions
execute). As such, the sum of these three registers will be greater than or
equal to the cycle count.
During the write back stage of the pipeline, performance counters should be
counted by the processor. To be precise, it is neither a branch stall nor a load
stall cycle. The current value of these counters can be determined by using a
LD or LDR instruction to access them. The LD instruction takes a source label
and stores its address into the destination register. The source register's value
plus an immediate value offset is stored in the LDR and then the destination
register stores it.
To avoid the complexities, the value of the registers is not changed by the
stores to these locations, the contents of memory may still be updated by the
stores. This hardly makes any change as, anytime these locations are
retrieved, the value in the counter is used rather than the value in the memory.
Basically, these counters can be reset to zero only when the entire system is
reset.
Self Assessment Questions
20. ____________ states the number of cycles lost to load-use stalls.
21. ____________ instruction takes a source label and stores its address
into the destination register.
22. ____________ stores the source register's value plus an immediate
value offset and stores it in the destination register.
CPI is cycles per Instruction which determine the cycle count for each
instruction. The ideal CPI on a pipelined machine is almost always 1.
Therefore, the pipelined CPI is:
CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instruction
If the cycle time overhead of pipelining is ignored and the stages are all
assumed to be perfectly balanced, then the two machines have an equal cycle
time and:
CPI
unpipelined
Speedup=
1+ Pipeline stall cycles per instruction
If all instructions take the same number of cycles, which must also equal the
number of pipeline stages (the depth of the pipeline) then unpipelined CPI is
equal to the depth of the pipeline, leading to
Pipeline depth
Speedup=
1 + Pipeline stall cycles per instruction
If there are no pipeline stalls, this leads to the intuitive result that pipelining
can improve performance by the depth of pipeline.
Self Assessment Questions
23. A __________ hazard causes the pipeline performance to degrade
the ideal performance.
24. CPI is the abbreviation for ___________ .
Activity 1:
Pick any two hazards from the organisation you previously visited. Now
implement the handling techniques to these hazards.
4.10 Summary
Let us recapitulate the important concepts discussed in this unit:
• A parallel processing system is able to perform concurrent data
4.13 Answers
Self Assessment Questions
1. Pipelining
2. Virtual parallelism
3. Load Memory Data
4. First-in first-out (FIFO) buffer
5. Linear
6. Non-Linear
7. Dynamic pipelines
8. Hazards
9. Resource conflicts
10. Data dependency
11. Branch difficulties
12. True
13. 6
14. Control Hazards, data hazards
15. Taken, not taken or untaken
16. Forwarding
17. Bypassing or short-circuiting
18. Freeze or flush the pipeline
19. Assume each branch as not-taken
20. Load-stall count
21. LD
22. LDR
23. Stall
24. Cycles per Instruction
Terminal Questions
1. The concurrent use of two or more CPU or processors to execute a
program is called parallel processing. For details -Refer Section 4.1.
2. An implementation technique by which the execution of multiple
instructions can be overlapped is called pipelining. Refer Section 4.2 for
more details.
3. There are two types of pipelining-Linear and non-linear. Refer Section 4.3
for more details.
4. Hazards are the situations that stop the next instruction in the instruction
stream from being executed during its designated clock cycle. Refer
Section 4.4.
5. There are two techniques to handle hazards namely minimising data
hazard stalls by forwarding and reducing pipeline branch penalties. Refer
Section 4.7.
References:
• David Salomon, Computer Organisation, 2008, NCC Blackwell
• John L. Hennessy and David A. Patterson, Computer Architecture: A
Quantitative Approach, Fourth Edition, Morgan Kaufmann Publishers
• Joseph D. Dumas II; Computer Architecture; CRC Press
• Nicholas P. Carter; Schaum’s outline of computer Architecture; McGraw-
Hill Professional
E-references:
• http://www.lc3help.com/tutorials/Basic_LC-3_Instructions/ Retrieved on
03-04-2012
• http://www.scribd.com/doc/4596293/LC3-Instruction-Details Retrieved
on 02-04-2012
• http://xavier.perseguers.ch/programmation/mips-
assembler/references/5-stage-pipeline.html
5.1 Introduction
In the previous unit, you studied pipelined processors in great detail with a
short review of pipelining and examples of some pipeline in modern
processors. You also studied various kinds of pipeline hazards and the
techniques available to handle them.
In this unit, we will introduce you to the design space of pipelines. Day-by- day
increasing complexity of the chips had lead to higher operating speeds. These
speeds are provided by overlapping instruction latencies or by implementing
pipelining. In the early models, discrete pipeline was used. Discrete pipeline
performs the task in stages like fetch, decode, execution, memory, and write-
back operations. Here every pipeline stage requires one cycle of time, and as
there are 5 stages so the instruction latency is of five cycles. Longer pipelines
over more cycles can hide instruction latencies.
This provides processors to attain higher clock speeds. Instruction pipelining
has significantly improved the performance of today’s processors. In this unit,
you will study the design space of pipelines which is further divided into basic
6 0 0 0 0
Number of Specification Layout of the Uta of Timing of the
stages of the subtasks stage aequence bypassing pipeline operations to bo performed In
each of the stages
next instruction from the memory, decode it, optimize the order of
execution and further sends the instruction to the destinations.
3. Calculate operand address: Now, the effective address of each source
operand is calculated.
4. Fetch operand/memory access: Then, the memory is accessed to fetch
each operand. For a load instruction, data returns from memory and is
placed in the Load Memory Data (LMD) register. If it is a store, then data
from register is written into memory. In both cases, the operand address
as computed in the prior cycle is used.
5. Execute instruction: In this operation, the ALU perform the indicated
operation on the operands prepared in the prior cycle and store the result
in the specified destination operand location.
6. Write back operand: Finally, the result into the register file is written or
stored into the memory.
These six stages of instruction pipeline are shown in a flowchart in figure 5.4.
Arithmetic or logical shifts can be easily implemented with shift registers. High-
speed addition requires either the use of a carry-propagation adder (CPA)
which adds two numbers and produces an arithmetic sum as shown in
figure5.6a, or the use of a carry-save adder (CSA) to "add" three input
numbers and produce one sum output and a carry output as exemplified in
figure 5.6b.
e.g. n=4
A = 10 11
<•) B = 0 111
S=10010=A*B
(Sum)
(a) An n-bit carry-propagate adder (CPA) which allows either carry
propagation or applies the carry-lookahead technique
e.g. n=4
X=
CSA
Sb= 0 1 0 0 0 1 1
+) C = 0 1 1 1 0 1 0
c Sb
8=1011111= Sb+C = X+Y+Z (Bitwise
(Carry
vector) sum)
(b) An n-bit carry-save adder (CSA), where Sb is the bitwise sum of X. Y, and Z. and
C is a carry vector generated without cany propagation between digits
10 1 10 1 0 1 - Po
101 1 0 1 0 1 0 -
00000 0 0 0 0 0 - P2
000000 00 0 0 0 ” Py
1011010 1 0 0 0 0 ° PA
00000000 00 0 0 0 = Pi
000000000 0 0 0 0 0 - P(,
+> 1 01101010000 0 00 = P7
0110011111101 1 11=P
The first stage (S1) generates all eight partial products, ranging from 8 bits to
15 bits, simultaneously. The second stage (S2) is made up of two levels of four
CSAs, and it essentially merges eight numbers into four numbers ranging from
13 to 15 bits. The third stage (S3) consists of two CSAs, and it merges four
numbers from S2 into two 16-bit numbers. The final stage (S4) is a CPA, which
adds up the last two numbers to produce the final product P.
For a maximum width of 16 bits, the CPA is estimated to need four gate levels
Activity 2:
Access the internet and find out more about the difference between fixed point
and floating point units.
US: Load/Store
Performance, trend
You can see that rows are showing the time steps and columns are showing
certain operations performed in time step. In this PES we can see that in
branch unit “ble” is not taken and it is theoretically executing instruction from
predicted path. In this example we have showed renaming values for only r3
register but others can also be renamed. Various values allotted to register r3
are bounded to different physical register (R1, R2, R3, R4).
Now you can see numerous ways of arranging instruction issue buffer for
boosting up the complexity.
Single queue method: Renaming is not needed in single queue method
because this method has 1 queue and no out of ordering issue. In this method
the operand availability could be handled through easy reservation bits allotted
to every register. During the instructional modification of register issues, a
register reserved and after the modification finished the register is cleared.
Multiple queue method: In multiple queue method, all the queues get
instruction issue in order. Due to other queues some queues can be issued
out. With respect to instruction type single queues are organized.
Reservation stations: In reservation stations, the instruction issue does not
follow the FIFO order. As a result for data accessibility, the reservation stations
at the same time have to observe their source operands. The conventional
way of doing this is to reserve the operand data in reservation station. As
reservation station receive the instruction then available operand values are
firstly read and placed in it.
After that it logically evaluate the difference between the operand designators
of inaccessible data and result designators of finishing instructions. If there is
similarity, then the result value is extracted to matching reservation station.
Instruction got issued as all the operands are prepared in reservation station.
It can be divided into instruction type for decreasing data paths or may behave
Manipal University Jaipur B1648 Page No. 120
Computer Architecture Unit 1
as a single block.
Self Assessment Questions
9. In traditional pipeline implementations, load and store instructions are
processed by the ___________________ .
10. The consistency of instruction completion with that of sequential
instruction execution is specified b ______________ .
11. Reordering of memory accesses is not allowed by the processor which
endorses weak memory consistency does not allow (True/False).
12. ____________ is not needed in single queue method.
13. In reservation stations, the instruction issue does not follow the FIFO
order. (True/ False).
5.6 Summary
• The design space of pipelines can be sub divided into two aspects:
basic layout of a pipeline and dependency resolution.
• An Instruction pipeline operates on a stream of instructions by
overlapping and decomposing the three phases (fetch, decode and
execute) of the instruction cycle.
• Two basic aspects of the design space are how FX pipelines are laid out
logically and how they are implemented.
• A logical layout of an FX pipeline consists, first, of the specification of how
many stages an FX pipeline has and what tasks are to be performed in
these stages.
• The other key aspect of the design space is how FX pipelines are imple-
mented.
• In logical layout of FX pipelines, the FX pipelines for RISC and CISC
processors have to be taken separately, since each type has a slightly
different scope.
• Pipelined processing of loads and stores consist of sequential consistency
of instruction execution and parallel execution.
5.7 Glossary
• CISC: It is an acronym for Complex Instruction Set Computer. The CISC
machines are easy to program and make efficient use of memory.
• CPA: It stands for carry-propagation adder which adds two numbers
and produces an arithmetic sum.
• CSA: It stands for carry-save adder which adds three input numbers
and produces one sum output.
Manipal University Jaipur B1648 Page No. 121
Computer Architecture Unit 1
• LMD: Load Memory Data.
• Load/Store bypassing: It defines that either loads can bypasss stores or
vice versa, without violating the memory data dependencies.
• Memory consistency: It is used to find out whether memory access is
performed in the same order as in a sequential processor.
• Processor consistency: It is used to indicate the consistency of
instruction completion with that of sequential instruction execution.
• RISC: It stands for Reduced Instruction Set Computing. RISC
computers reduce chip complexity by using simpler instructions.
• ROB: It stands for Reorder Buffer. ROB is an assurance tool for
sequential consistency execution where multiple EUs operate in parallel.
• Speculative loads: They avoid memory access delay. This delay can be
caused due to the non- computation of required addresses or clashes
among the addresses.
• Tomasulo’s algorithm: It allows the replacement of sequential order by
data-flow order.
5.9 Answers
Self Assessment Questions
1. Microprocessor without Interlocked Pipeline Stages
2. Dynamically
3. Write Back Operand
4. Opcode, operand specifiers
5. Register operands
6. True
Terminal Questions
1. The design space of pipelines can be sub divided into two aspects: basic
layout of a pipeline and dependency resolution. Refer Section 5.2.
2. A pipeline instruction processing technique is used to increase the
instruction throughput. It is used in the design of modern CPUs,
microcontrollers and microprocessors.Refer Section 5.3 for more details.
3. There are two basic aspects of the design space of pipelined execution of
Integer and Boolean instructions: how FX pipelines are laid out logically
and how they are implemented. Refer Section 5.4.
4. While processing operates instructions, RISC pipelines have to cope only
with register operands. By contrast, CISC pipelines must be able to deal
with both register and memory operands as well as destinations. Refer
Section 5.4.
5. Depending on the function to be implemented, different pipeline stages in
an arithmetic unit require different hardware logic. Refer Section 5.4.
6. The execution of load and store instructions begins with the
determination of the effective memory address (EA) from where data is to
be fetched. This can be broken down into subtasks. Refer
Section 5.5.
7. The overall instruction execution of a processor should mimic sequential
execution, i.e. it should preserve sequential consistency. Refer Section
5.5. The first step is to create and buffer execution and then determine
which tuples can be issued for parallel execution. Refer Section 5.5.
References:
• Hwang, K. (1993) Advanced Computer Architecture. McGraw-Hill.
• Godse D. A. & Godse A. P. (2010). Computer Organisation, Technical
Publications. pp. 3-9.
• Hennessy, John L., Patterson, David A. & Goldberg, David (2002)
Computer Architecture: A Quantitative Approach, (3rd edition), Morgan
Manipal University Jaipur B1648 Page No. 123
Computer Architecture Unit 1
Kaufmann.
• Sima, Dezso, Fountain, Terry J. & Kacsuk, Peter (1997) Advanced
computer architectures - a design space approach, Addison-Wesley-
Longman: I-XXIII, 1-766.
E-references:
• http://www.eecg.toronto.edu/~moshovos/ACA06/readings/ieee-
proc.superscalar.pdf
• http://webcache.googleusercontent.com/search?q=cache:yU5nCVnju9
cJ:www.ic.uff.br/~vefr/teaching/lectnotes/AP1-topico3.5.ps.gz+load+
store+sequential+instructions&cd=2&hl=en&ct=clnk&gl=in
Structure:
6.1 Introduction
Objectives
6.2 Dynamic Scheduling
Advantages of dynamic scheduling Limitations of dynamic
Scheduling
6.3 Overcoming Data Hazards
6.4 Dynamic Scheduling Algorithm - The Tomasulo Approach
6.5 High performance Instruction Delivery
Branch target buffer
Advantages of branch target buffer
6.6 Hardware-based Speculation
6.7 Summary
6.8 Glossary
6.9 Terminal Questions
6.10 Answers
6.1 Introduction
In pipelining, two or more instructions that are independent of each other can
overlap. This possibility of overlap is known as ILP (instruction-level
parallelism). It is addressed as ILP because the instructions may be assessed
parallelly. Parallelism level is quite small in straight-line codes where there are
no branches except the entry or exit. The easiest and most widely used
methodology to enhance parallelism is by exploiting parallelism among the
Here F0, F1, F2....F14 are the floating point registers (FPRs) and DIVD, ADDD
and SUBD are the floating point operations on double precision(denoted by
D). The dependence of ADDD on DIVD causes a stall in the pipeline; and thus,
the SUBD instruction cannot execute. IF the instructions are not executed in
same sequence then this limitation could be ruled out.
In case of DLX (DLX is a RISC processor architecture) pipeline, the structural
& data hazards are examined during the instruction decode (ID). If any
instruction can carry out appropriately, it is issued from ID. To commence with
the execution of the SUBD, we need to examine the following two issues
separately:
• Firstly we need to analyse the any type of structural hazards
• Secondly, we need to wait for the non-occurrence of any data hazard.
In this example you can see that ADDD and SUBD are interdependent. If
SUBD is executed before ADDD, then the data interdependence will be
violated resulting in wrong execution. Similarly, to refrain output dependencies
violation, it is essential to detect WAW (Write after Write) data hazards
Scoreboard technique helps to minimize or remove both the structural as well
as the data hazards. Scoreboard stalls the later instruction that is engaged in
the interdependence. Scoreboard’s goal is to execute an instruction in each
clock cycle (in situation where no structural hazards exist). Therefore, when
any instruction is stalls, some other independent instructions may be executed.
The scoreboard technique takes complete accountability for issuing and
executing the instruction together with all hazards detection. To take
advantage of executing instructions out-of-order necessarily requires several
instructions to be executed simultaneously. We can achieve this by use of
either of the two ways:
1. By utilizing pipelined functional units
2. By using multiple functional units
The above given ways are necessary for pipeline control. Here we will consider
the use of multiple functional units.
CDC 6600 comprises of 16 distinct functional units. These are of following
types:
Manipal University Jaipur B1648 Page No. 128
Computer Architecture Unit 1
• Four FPUs (floating-point units)
• Five units for memory references
• Seven units for integer operations.
FPUs are of prime importance in DLX scoreboards in comparison to other FU
(functional units).
For example: We have 2 multipliers, 1 adder, 1 divide unit, and 1 integer unit
for all integer operations, memory references and branches.
The methodology for the DLX & CDC 6600 is quite similar as both of these are
load-store architectures. Given below in figure 6.1 is the basic structure of a
DLX Processor with a Scoreboard.
Now let us study the four steps in the scoreboard technique in detail.
1. Issue: Issue step is used as a replacement of a part of ID step of DLX
pipeline. In this step the instruction is forwarded to FU. The internal data
construction is also modified here. It is done only in two situations:
• FU for the instruction is jobless.
• No other active instruction has the same register as destination. This
ensures that the operation is free from WAW (Write after Write) data
hazard.
When any structural or WAW hazards are detected, the stall occurs and
the issue of all subsequent instructions is stopped until these data hazards
have been corrected. when a stall occurs in this stage, the buffer between
instruction issue and fetch is filled. If buffer contains a single instruction
then the instruction fetch also stalls at once but if the buffer space contains
a queue, it creates stalls only after the buffer queue is fully filled.
2. Read operands: The scoreboard examines if the source operands is
available or not. The source operand is said to be available when no
previously issue active instruction is ready to write to it. The scoreboard
prompts the FU to start reading the operands from data registers and start
execution as soon as the source operands become available. Read after
Write (RAW) hazards are resolved in a dynamic manner during this stage.
It may also send instructions for out-of-order execution. Issue and read
operand step together completes the functions of the ID step of DLX
Manipal University Jaipur B1648 Page No. 130
Computer Architecture Unit 1
pipeline.
3. Execution: After receiving the operands, the FU starts execution. on
completion of execution, the result is generated. Thereafter FU informs the
scoreboard about the completion of execution step. Execution step is used
in place of EX step of DLX pipeline but in latter it may involve multiple
cycles.
4. Write result: after the FU completes execution, the scoreboard detects
whether the WAR hazards are present or not. If the WAR hazard is
detected, it stalls the instruction. WAR hazard occurs when there is an
instruction code as in our earlier example of ADDD & SUBD where both
utilize F8. The code for that example is again shown below:
Here you can see that the source operand for ADDD is F8 that is similar to the
destination register of SUBD. However, ADDD in fact is dependent on the
previous instruction DIVD. In this case, the scoreboard will stall SUBD in its
write result stage till the time ADDD read its operands.
Any completing instruction may not be permitted to write its results in following
cases:
• when there exists any instruction which hasn’t read its operands that
precedes (i.e., in issuance order) the completing instruction
• one of the operands is the same register as the result of the completing
instruction
Instruction status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Activity 1:
Imagine yourself as a computer architect. Explain the measures you will take
to overcome data hazards with dynamic scheduling.
Assuming that:
• Misprediction penalty = 4 cycles
• Buffer miss-penalty = 3cycles
• Hit rate and accuracy each = 90%
• Branch Frequency = 15%
Solution:
The speedup with Branch Target Buffer verses no BTB is expressed as:
Speedup = CPI no BTB/CPI BTB
= (CPI base+Stallsno BTB) / (CPI base + Stalls BTB)
The stalls are determined as:
Stalls = ZFrequency x Penalty
The sum over all the stall cases is given as the product of frequency of the stall
cases and the stall-penalty.
i) Stallsno BTB = 0.15 x 2 = 0.30
ii) To find Stalls BTB, we have to consider each output from BTB
There exist three possibilities:
a) Branch misses the BTB:
Frequency = 15 % x 0.1 = 1.5% = 0.015
Penalty = 3
Stalls=0.045
b) Branch can hit and correctly predicted:
Frequency = 15 % x 0.9(htt)x 0.9^^^)= 12.1% = 0.121
Penalty = 0
Stalls= 0
c) Branch can hit but incorrectly predicted:
Frequency = 15 % x 0.9 (hit) x 0.1 (misprediction) = 1.3% = 0.013 Penalty
=4
Stalls = 0.052
iii) Stalls BTB = 0.045 + 0 + 0.052 = 0.097
Speedup = (CPIbase + Stallsno BTB) / (CPIbase + Stal^)
= (1.0 + 0.3) / (1.0 + 0.097)
Manipal University Jaipur B1648 Page No. 138
Computer Architecture Unit 1
= 1.2
In order to achieve more instruction delivery, one possible variation in the
Branch Target Buffer is:
• To keep one or more than one target instructions, instead of or in addition to,
the anticipated Target Address
6.5.2 Advantages of branch target buffer
There are several advantages of branch target buffer. They are as follows:
• It possibly allows larger BTB as it allows access to take more time between
consecutive instructions fetches
• Buffering the actual Target-Instructions allow Branch Folding, i.e., ZERO
cycle Unconditional Branching or sometimes ZERO Cycle conditional
Branching
Self Assessment Questions
12. The branch-prediction buffer is accessed during the _____ stage.
13. The _____ field helps check the addresses of the known branch
instructions.
14. Buffering the actual Target-Instructions allow ___________ .
6.7 Summary
Let us recapitulate the important concepts discussed in this unit:
• In pipelining, implementation of instructions independent of one another
can overlap. This possible overlap is known as instruction-level parallelism
(ILP)
• Pipeline fetches an instruction and executes it.
• In DLX pipelining, all the structural & data hazards are analyzed
throughout the process of instruction decode (ID).
• A dynamic scheduling is the hardware based scheduling. In this
approach, the hardware rearranges the instruction execution to reduce the
stalls.
6.8 Glossary
• Dynamic scheduling: Hardware based scheduling that rearranges the
instruction execution to reduce the stalls.
• EX: Execution stage
• FP: Floating-Point Unit
• ID: Instruction Decode
• ILP: Instruction-Level Parallelism
• Instruction-level parallelism: Overlap of independent instructions on one
another
• Static scheduling: Separating dependent instructions and minimising the
number of actual hazards and resultant stalls.
6.9 Terminal Questions
1. What do you understand by instruction-level parallelism? Also, explain
loop-level parallelism.
2. Describe the concept of dynamic scheduling.
3. How does the execution of instructions take place under dynamic
scheduling with score boarding?
4. What is the goal of score boarding?
5. Explain the tumasulo approach.
6.10 Answers
Self Assessment Questions
1. Static scheduling
2. Check the structural hazards, wait for the absence of a data hazards
3. An instruction fetch
4. EX
5. SUBD, ADDD
6. Pipelined, multiple
References:
• John L. Hennessy and David A. Patterson, Computer Architecture: A
Quantitative Approach, Fourth Edition, Morgan Kaufmann Publishers.
• David Salomon, Computer Organisation, 2008, NCC Blackwell.
• Joseph D. Dumas II; Computer Architecture; CRC Press.
• Nicholas P. Carter; Schaum’s Outline of Computer Architecture; Mc. Graw-
HiLl Professional.
7.1 Introduction
In the previous unit, you studied Instruction-level parallelism and its dynamic
exploitation. You learnt how to overcome data hazards with dynamic
scheduling besides performance instruction delivery and hardware based
speculation.
As mentioned in the previous unit, inherent property of a sequence of
instructions, results in execution of some instructions parallel which is also
known as Instruction level parallelism (ILP). There is an upper bound, as to
how much parallelism can be achieved. We can approach this upper bound
via a series of transformations that either expose or allow more ILP to be
exposed to later transformations. The best way to exploit ILP is to have a
Objectives:
After studying this unit, you should be able to:
• identify the various types of branches
• explain the concept of branch handling
• describe the role of delayed branching
• recognise branch processing
• discuss the process of branch prediction
• explain Intel IA-64 architecture and Itanium processor
• discuss the use of ILP in the embedded and mobile markets
The conditional branch instruction given above performs the testing of the
contents available in two registers, that is, Rsrc1 as well as Rsrc2 for
equality. The control is transferred to the target if their values appear to be
equal. Let us suppose that the numbers that are to be compared are
placed in register t0 and register t1. For this, the branch instruction is
written as below:
beq $t1,$t0,target
The instruction given above substitutes the two-instruction cmp/je
sequence which is utilised by Pentium.
Registers are maintained by some of the processors. This is done for recording
the condition of arithmetic as well as logical operations. We call these registers
as condition code registers.
The status of the last arithmetic or logical operation is recorded by these
registers. For instance, if two 32-bit integers are added, i then the sum might
need more than 32 bits. It is an overflow condition which should be recorded
by the system. Usually, this overflow condition is indicated by setting a bit in
condition code register. For example, the MIPS, does not make use of
condition registers. Rather, it to flag the overflow condition exceptions is used.
Alternatively, th processors such as the Pentium, SPARC, and Power PC
Activity 1:
Work on an MIPS processor to find out the difference between conditional
and unconditional branching.
target:
mult R8, R9, R10
... .. .
The process of moving instructions into delay slots is not an issue of worry for
programmers. This task is accomplished by compilers in addition to
assemblers. If any valuable instruction cannot be moved into delay slot, NOP
operation (no operation) is placed. This is to observe that if the branch is not
taken, we would not like to provide execution to delay slot instruction. This
means that we would like to nullify the instruction in delay slot. A number of
processors such as SPARC offer this option of nullification.
Self Assessment Questions
5. A number of processors such as __________ and _________ make
use of delayed execution for procedure calls as well as branching.
6. If any valuable instruction cannot be moved into delay slot, is placed.
It is presumed by the data in the table given above that approximately 60% of
the time conditional branches are not taken. Therefore this prediction of
conditional branch is accurate only sixty percent of the time. So now we get
the following:
42 x 0.6 = 25.2%
This is the prediction accuracy in case of conditional branches.
Likewise, loops jump back having 90% possibility. As loops emerge about 10%
of the time, 9% prediction appears to be accurate. To our surprise, even this
static prediction approach provides accuracy of about 82%.
7.6.3 Dynamic branch prediction
For making more accurate predictions, this approach considers run-time
history. Here the n branch executions of history are considered and this
information is used for predicting the next one.
The experiential study done by Smith and Lee proposes that this approach
provides major enhancement in prediction accuracy. In table 7.2, we have
shown a summary of what they have studied.
Table 7.2: Affect of utilising the information of Past Branches on Prediction
Accuracy
An algorithm that is applied is simple. That is, the next branch prediction is the
majority of n branch executions of past. For instance, let us suppose n = 3.
That is, if three branch executions of the past includes two or more times
branches, then the prediction that occurs is the branch that will be taken.
In table 7.2, the data propose that if we consider l two branch executions of
In the figure given above, the left bit signifies the prediction whereas the right
bit signifies the status of branch (that is, whether branch is taken or not). In
case the left bit appears to be”0”, then the prediction would occur as “not
taken”. Or else it is predicted that the “branch is taken”. Actual outcome of
branch instruction is provided by right bit. Therefore, “branch not taken” is
signified by a “0”. This means that branch instruction didn't jump. On the other
hand, “branch is taken” is signified by “1”. For instance, state 00 signifies that
it predicted left zero bit (branch would not be taken) () and right zero bit (branch
is definitely not taken) (). Thus, we stay in state 00 in the case when branch is
not taken, In case the prediction is incorrect, we move to state 01. But, still
“branch not taken” is predicted since we were incorrect just once. In case the
prediction is right, we move to state 00 again. If the prediction appears to be
incorrect again, then we change the prediction to “branch taken”. Also we will
move to state10. Thus, on the occurrence of two wrong predictions one after
the other makes us change the prediction.
PREDICATE
REGISTER
Number
Operation
Category Examples of
Operation
Comment
Load/store cps s 33 signed, unsigned,register
ld8, ldl6, H3 2,1mm. st8, stl6, st32 indirect, indexed, scaled
addressing
Byte shuffles SIMD type convert
shiftrighr 1-.2-, 3-bytes, selectbyte, mergp,
pack 1
Bit shifts asl, asr, Isl, 1ST, rol,
mul, sum of products, sum-of-SIMD-
1 10 shifts, SIMD
round, saturate. 2’scomp
Multiplies and 23
multimedia elements, multimedia, e.g. sum of products SIMD ~
(FIR)
Integer arithmetic add, sib,min, max, abs, average, bitand, bitor, 62
saturate, 2’s comp,
bitxor, bitinv, bitandinv eql, neq, gtr, geq, les, unsigned, immediate,
leq, sign extend, zero extend, sum of absolute SIMD
differences
Floating point add, sub, neg ,mul, div, sqn eql, neq, gtr, geq, 42 scalar
les, leq, IEEE flags
Special ops alloc, prefetch, copy back, read tag read, 20 cache, special regs
cache status, read counter
Branch jmpt, jmpf 6 (un) interruptible
Total 207
Figure 7.7: Operations found in Trimedia TM32 CPU
One of the unusual characteristic from the desktop point of view is that the
programmer is allowed to state five autonomous operations that can be issued
simultaneously. In case the five autonomous instructions are not available
(which means that others are dependent), then no operations (NOPs) are
Manipal University Jaipur B1648 Page No. 161
Computer Architecture Unit 1
positioned in the remaining slots. We call this method of instruction coding a
VLIW (Very Long Instruction Word) method.
It is known that as Trimedia TM32 CPU comprise longer instruction words and
frequently includes NOPs, the instructions of Trimedia are compressed in the
memory. Also the instructions are decoded to the full size when they are
loaded into cache. In Figure 7.8, we have shown the TM32 CPU instruction
mix for EEMBC bench-marks.
Figure 7.8: TM32 CPU Instruction Mix for EEMBC Customer Benchmark
By means of source code which is unmodified, instruction mix is analogous to
others, even though more byte data transfers are there. For aligning the data
for SIMD instructions, the huge number of pack is observed and the
instructions are merged. Computers used for general purpose (having higher
importance byte data transfers) and the instruction mix for “out-of- the-box” C
code is considered similar to each other. The Single instruction, multiple data
(SIMD) instructions along with the pack are used by means of the hand-
7.9 Summary
• Implementation of branching is done by using a branch instruction. The
address of target instruction is included in the branch instruction
• The branch penalty can be reduced to one cycle. It can be efficiently
reduced further by means of Delayed branch execution.
• Effective processing of branches has become a cornerstone of increased
performance in ILP-processors.
• Branch prediction is a method which is basically utilised for handling the
problems related to branch. Different strategies of branch prediction
include:
❖ Fixed branch prediction
❖ Static branch prediction
❖ Dynamic branch prediction
• The new architecture, generated mutually by means of Hewlett Packard
as well as Intel , is known as IA-64
• IA-64 model is also known as Explicitly Parallel Instruction Computing
(EPIC).
• Itanium comprises a group of 64-bit Intel microprocessors which provides
execution to the Intel Itanium architecture. This architecture was initially
known as IA-64.
• Interesting strategies are represented by the Crusoe chips and Trimedia
for applying the concepts of Very long instruction word (VLIW) in an
embedded space. Trimedia processor may be the closest existing
processor to a "classic" processor of VLIW.
7.10 Glossary
References:
• Hwang, K. Advanced Computer Architecture. McGraw-Hill.
• Godse, D. A. & Godse, A. P. Computer Organization. Technical
Publications.
• Hennessy, John L., Patterson, David A. & Goldberg David. Computer
E-references:
• http://www.scribd.com/doc/46312470/37/Branch-processing,
• http://www.scribd.com/doc/60519412/15/Another-View-The-Trimedia-
TM32-CPU-151.
Structure:
8.1 Introduction
Objectives
8.2 Memory Hierarchy
Cache memory organisation
Basic operation of cache memory
Performance of cache memory
8.3 Cache Addressing Modes
Physical address mode
Virtual address mode
8.4 Mapping
Direct mapping
Associative mapping
8.5 Elements of Cache Design
8.6 Cache Performance
Improving cache performance
Techniques to reduce cache miss
Techniques to decrease cache miss penalty
Techniques to decrease cache hit time
8.7 Shared Memory organisation
8.8 Interleaved Memory Organisation
8.9 Bandwidth and Fault Tolerance
8.10 Consistency Models
Strong consistency models
Weak consistency models
8.11 Summary
8.12 Glossary
8.1 Introduction
You can say that Memory system is the important part of a computer system.
The input data, the instructions necessary to manipulate the input data and the
output data are all stored in the memory.
Now, we let us discuss cache memory and the cache memory organisation.
8.2.1 Cache memory organisation
A cache memory is an intermediate memory between two memories having
large difference between their speeds of operation. Cache memory is located
between main memory and CPU. It may also be inserted between CPU and
RAM to hold the most frequently used data and instructions. Communicating
with devices with a cache memory in between enhances the performance of a
system significantly. Locality of reference is a common observation that at a
particular time interval, references to memory acts limited for some localised
memory areas. Its illustration can be given by making use of control structure
like 'loop'. Cache memories exploit this situation to enhance the overall
performance.
Whenever a loop is executed in a program, CPU executes the loop repeatedly.
Hence for fetching instructions, subroutines and loops act as locality of
reference to memory. Memory references also act as localised.
Table look-up procedure continually refers to memory portion in which table is
stored. These are the properties of locality of reference. Cache memory’s basic
idea is to hold the often accessed data and instruction in quick cache memory,
the total accessed memory time will attain almost the access time of cache.
The fundamental idea of cache organisation is that by keeping the most
frequently accessed instructions and data in the fast cache memory, the
average memory access time will reach near to access time of cache.
8.2.2 Basic operation of cache memory
Whenever CPU needs to access the memory, cache is examined. If the file is
found in the cache, it is read from the fast memory. If the file is missing in cache
then main memory is accessed to read the file. A set of files just accessed by
CPU is then transferred from main memory to cache memory.
Cache Hit: When the addressed data or instruction is found in cache during
operation, it is called a cache hit.
Cache Miss: When the addressed data or instruction is not found in cache
during operation, it is called a cache miss. At the time of cache miss, a
complete cache block is loaded from the equivalent memory location at one
time.
Implementation on Split Cache: When physical address is used in split
cache, both data cache and instruction cache are accessed with a physical
address after translation from MMU. In this design, the first-level D-cache uses
write-through policy as it is a small one (64 KB) and the second-level D-cache
uses write-back policy as it is larger (256 KB) with slower speed. The I-cache
is a single level cache that has a smaller size (64 KB). The implementation of
physical address mode on split cache is illustrated in figure 8.3.
In figure 8.4 you can see that a unified cache is in direct contact with virtual
address. It is known as virtual address cache. In the above figure you can also
see that Main Memory Unit and cache validation or interpretation is performed
simultaneously. The cache lookup operation does not use the physical address
produced by the Main Memory Unit but it can be saved for later use. The virtual
address cache is encouraged with the improved proficiency of quick cache
accessing; it is overlapped with the MMU translation.
Advantages of virtual address mode: The virtual mode of cache addressing
offers the following advantages:
• It eliminates address translation time for a cache hit since misses are not
common as hits.
• Cache lookup is not delayed.
8.4 Mapping
Mapping refers to the translation of main memory address to the cache
memory address. The transfer of information from main memory to cache
memory is conducted in units of cache blocks on cache lines. Blocks in caches
are called block frames which are denoted as
Bi for i = 1, 2, ...j
where j is the entire block frames in caches.
The corresponding memory blocks are denoted as
Bj for j = 1, 2, ...k
where k is the total number of blocks in memory. It is assumed that
k >> j and k = 2s and j = 2r
Where s is the number of bits required to address a main memory block, and
r is number of bits required to address a cache memory block.
There are four types of mapping schemes: direct mapping, associative
mapping, set associative mapping, and sequential mapping. Here, we will
discuss the first two types of mapping.
8.4.1 Direct mapping
Associative memories are very costly as compared to RAM due to the
additional logic association with all cells. Generally there are 2j words in main
memory and 2k words in cache memory. The j-bit memory address is
separated by 2 fields. k bits are used for index field. j-k bits are long-fields. The
direct mapping cache organization utilizes k-bit indexes to access the cache
memory and j-bit address for main memory. Cache words contain data and
related tags. Every memory block is assigned to a particular line of cache in
direct mapping. But if a line already contains memory block when new block is
to be added then the old memory block is removed. The figure 8.5 illustrates
Tag bits are stored next to data bits as new word enters the cache. Once
processor has produced a memory request, the cache index field is utilized for
the main memory address to access cache. Tag in word in cache is evaluated
with tag field in processor address. If this comparison is positive there is a hit
and the word is found in cache. If the comparison is negative then it is a miss
and the word is read in main memory. The word is then stored in cache with
the new tag and deletes the previous value.
Demerits of direct mapping: If the two words have similar addresses and
indexes then the hit ratio will fall substantially. But dissimilar tags are accessed
continually.
8.4.2 Associative mapping
Associative mapping is used in cache organization which is the quickest and
most supple organization. Addresses of the word and content of the words are
stored in associative memory. It means cache can store any word in main
memory.
For example, in figure 8.6, CPU address is first placed in argument register
Manipal University Jaipur B1648 Page No. 175
Computer Architecture Unit 1
and then associative memory is explored for the match of the above address.
Cache memory
Activity 1:
Visit an organisation and find out the cache memory size and the costs they
are incurring to hold it. Also try to retrieve the size of the data stored in the
cache memory.
Shared Bus
Figure 8.7: Shared Memory Organisation
As long as the processor requires a single memory read at a time, the above
memory arrangement with a single MAR, a single MDR, a single Address bus
and a single Data bus is sufficient. However, if more than one read is required
simultaneously, the arrangement fails. This problem can be overcome by
adding as many address and data bus pairs along with respective MARs and
MDRs. But buses are expensive as equal number of bus controllers will be
required to carry out the simultaneous reads.
An alternative technique to handle simultaneous reads with comparatively low
cost overheads is memory interleaving. Under this scheme, the memory is
divided into numerous modules which is equivalent to the number of
simultaneous reads required, having their own sets of MARs and MDR but
sharing common data and address buses. For example, if an instruction
pipeline processor requires two simultaneous reads at a time, the memory is
partitioned into two modules having two MARs and two MDRs, as shown in
figure 8.10.
„ 0m-i
ti - (1+—)
mn
0
Where, n >/. (very long vector), ti ^— - r .As n ^ 1 (scalar access), m
t1 ^ 0
Fault Tolerance: High- and low-order interleaving could be mixed to generate
various interleaved memory organisations. Sequential addresses are assigned
in the high-order interleaved memory in each memory module.
This makes it easier to isolate faulty memory modules in a memory bank of m
memory modules. When one module failure is detected, the remaining
modules can still be used by opening a window in the address space. This fault
isolation cannot be carried out in a low-order interleaved memory, in which a
8.12 Glossary
• Associative Mapping: Associative mapping is used in cache
organization which is the quickest and most supple organization.
• Auxiliary memory Auxiliary memory is a high-speed memory which
provides backup storage and not directly accessible by CPU but it is
connected with main memory.
• Cache Memory Organisation: A small, fast and costly memory that is
placed between a processor and main memory.
• Main memory: Refers to physical memory that is internal to the computer.
• Memory interleaving: A category of techniques for increasing memory
speed. NUMA Multiprocessing: Non-Uniform Memory Access
multiprocessing.
• RAM: Random-access memory
• Split Cache Design: A design where instructions and data are stored in
Manipal University Jaipur B1648 Page No. 187
Computer Architecture Unit 1
different caches for execution conveniences.
8.14 Answers
Self Assessment Questions
1. Main memory
2. Auxiliary memory
3. Cache
4. Hit
5. Unified
6. Translation Lookside Buffer
7. Mapping
8. Associative
9. Split
10. Instruction Cache
11. Hit time
12. Miss Penalty
13. Instruction-Level Parallelism
14. Thread-Level Parallelism
15. Interleaved Memory Organisation
16. MARs and MDRs
17. m, 1
18. Sequential addresses
19. Consistency
20. Strong, Weak
Terminal Questions
1. Memory hierarchy contains the Cache Memory Organisation. Refer
References:
• Kai Hwang: Advanced Computer Architecture, Parallelism, Scalablility,
Programmability - MGH
• Micheal J. Flynm: Computer Architecture, Pipelined & Parallel Processor
Design - Narosa.
• J.P. Haycs: Computer Architecture & Organisation - MGM
• Nicholas P. Carter; Schaum’s Outline of Computer Architecture; Mc. Graw-
Hill Professional
E-references:
• www.csbdu.in/
• www.cs.hmc.edu/
• www.usenix.org
• cse.yeditepe.edu.tr/
Structure:
9.1 Introduction
Objectives
9.2 Use and Effectiveness of Vector Processors
9.3 Types of Vector Processing
Memory-memory vector architecture
Vector register architecture
9.4 Vector Length and Stride Issues
Vector length
Vector stride
9.5 Compiler Effectiveness in Vector Processors
9.6 Summary
9.7 Glossary
9.8 Terminal Questions
9.9 Answers
9.10 Introduction
In the previous unit, you learnt about memory hierarchy technology and related
aspects such as cache addressing modes, mapping, elements of cache
design, cache performance, shared & interleaved memory organisation,
bandwidth & fault tolerance, and consistency models. In this unit, we will
introduce you to vector processors.
A processor design which has the capacity to operate mathematical operations
upon multiple data elements at the same time is called a vector processor.
This is just opposite to a scalar processor, which is able to tackle just a single
element at one time. A vector processor is also called array processor. Vector
processing was first successfully implemented in the CDC STAR-100 and
Advanced Scientific Computer (ASC) of the Texas Instruments. The vector
technique was first fully exploited in the famous Cray-1. The Cray design had
eight vector registers which held sixty-four 64-bit words each. The Cray-
1usually had a performance of about 80 MFLOPS (Million floating-point
operations per second), but with up to three chains running, it could hit the
highest point at 240 MFLOPS.
In this unit, you will study about various these processors such as types, uses
and effectiveness of these processors. You will also study vector length and
stride issues, and compiler effectiveness in vector processors.
Objectives:
After studying this unit, you should be able to:
• state the use and effectiveness of vector processors
• identify the types of vector processing
• describe memory-memory vector architecture
• discuss the use of CDC Cyber 200 model 205 computer
• explain vector register architecture
• recognise the functional units of vector processor
• discuss vector instructions and vector processor implementation (CRAY-
1)
• solve vector length and stride issues
• explain compiler effectiveness in vector processors
vector processor. As the elements of the vector require to be taken out from
memory than from a register, it takes a bit longer to start a vector operation;
this is partially because of the price of a memory access. An instance of a
memory-memory vector processor is the CDC Cyber 205.
Because of the ability to overlap memory accesses as well as the probable
reprocess of vector processors, ‘vector-register processors’ are normally
more productive and efficient as compared to ‘memory-memory vector
processors’. However, because the vectors’ length in a computation rises,
such a difference in effectiveness between the two kinds of architectures drops
down. In reality, the memory-memory vector processors can prove much
efficient when it comes to long vectors. However, experience displays that
smaller vectors are more commonly utilised.
Planned on the concepts initiated for the CDC Star 100, the first commercial
model of the CDC Cyber 205 was handed over in 1981. Such a supercomputer
is a memory-memory vector machine and fetches vectors directly from
memory to load the pipelines as well as stores the pipeline outcomes directly
to memory. Besides, it does not contain any vector registers. Consequently,
the vector pipelines have large start-up times. Instead of pipelines designed
for specific operations, such a machine consists of up to four general-purpose
pipelines. It also provides gather as well as scatter functions. ETA-10 is an
updated modern shared-memory multiprocessor version of the CDC Cyber
205.The next section provides more detail of this model.
CDC Cyber 200 model 205 computer overview: The Model 205 computer is
a super-scale, high-speed, logical and arithmetic computing system. It utilises
LSI circuits in both the scalar and vector processors that improve performance
to complement the many advanced features that were implemented in the
STAR-100 and CYBER 203 (these are the two Control Data Corp. computers
with built-in vector processors), like hardware macroinstructions, virtual
addressing and stream processing. The Model 205 contains separate scalar
and vector processors particularly designed for sequential and parallel
operations on single bit, 8-bit bytes, and 32-bit or 64-bit floating-point operands
and vector elements.
The central memory of the Model 205 is a high-performance semiconductor
memory with single-error correction, double-error detection (SECDED) on
each 32-bit half word, providing extremely high storage integrity. Virtual
input/output ports.
9.3.2 Vector register architecture
In a vector-register processor, the entire vector operations excluding load and
store are in the midst of the vector registers. Such architectures are the vector
equivalent of load-store architecture. Since the late 1980s, all major vector
computers have been using a vector-register architecture which includes the
Cray Research processors (Cray-1, Cray-2, X-MP, YMP, C90, T90 and SV1),
Japanese supercomputers (NEC SX/2 through SX/5, Fujitsu VP200 through
VPP5000, and the Hitachi S820 and S-8300), and the mini-
supercomputers(Convex C-1 through C-4).
All vector operations are memory to memory in a memory-memory vector
processor, the initial vector computers and CDC’s vector computers were of
such kind. Vector register architectures possess various benefits over vector
memory-memory architectures. It is necessary for the vector memorymemory
architecture to write the entire intermediate outcomes to memory as well as
later on read them back from memory. Vector register architecture is able to
maintain intermediate outcomes in the vector registers just near to the vector
functional units, decreasing temporary storage needs, inter-instruction latency
and memory bandwidth needs.
In case a vector outcome is required by multiple other vector instructions,
memory-memory architecture should read it from memory innumerable times;
while a vector register machine can use the value from vector registers once
again, thereby decreasing memory bandwidth needs. For such reasons, vector
register machines have proved to be more effective practically.
Components of a vector register processor: The major components of the
vector unit of a vector register machine are as given below:
1. Vector registers: There are many vector registers that can perform
different vector operations in an overlapped manner. Every vector register
is a fixed-length bank that consists of one vector with multiple elements
and each element is 64-bit in length. There are also many read and write
ports. A pair of crossbars connects these ports to the inputs/ outputs of
functional unit.
2. Scalar registers: The scalar registers are also linked to the functional
units with the help of the pair of crossbars. They are used for various
purposes such as computing addresses for passing to the vector
load/store unit and as buffer for input data to the vector registers.
3. Vector functional units: These units are generally floating-point units that
are completely pipelined. They are able to initiate a new operation on each
clock cycle. They comprise all operation units that are utilised by the vector
instructions.
4. Vector load and store unit: This unit can also be pipelined and perform
an overlapped but independent transfer to or from the vector registers.
5. Control unit: This unit decodes and coordinates among functional units.
It can detect data hazards as well as structural hazards. Data hazards are
the conflicts in register accesses while functional hazards are the conflicts
in functional units.
Figure 9.1 gives you a clear picture of the above mentioned functional units
of vector processor.
Activity 1:
Find out more about a recent vector thread processor which comes in two
parts: the control processor, known as Rocket, and the vector unit, known as
Hwacha.
As we cannot write
V L 40,
We must utilise the two-instruction order for loading 40 into the VL register.
The last instruction indicates floating-point addition of vectors V3 and V4. As
the VL is 40, just the first 40 elements are included. Table 9.1 below depicts a
sample of Cray X-MP instructions.
Table 9.1: Sample Cray X-MP Instructions
Instruction Meaning Description
Vi V j +Vk Vi = Vj+Vk Integer add Add corresponding elements (in the range 0 to XT 1) from Vj
and Vk vectors and place the result in vector Vi
Vi S j+Vk
Vi = Sj+Vk Add the scalar Sj to each element (in the range 0 to
Integer add XT 1} of Vk vector and place the result m vector Vi
Vi Vj+FVk Vi = Vj+Vk Add corresponding element- (in the range 0 to VL 1) from Vj
Floating-point add and Vk vectors and place the floating-point result in vectorVi
Vi Sj+FVk Vi = Sj+Vk Add the scalar S j to each element (in the range fl io VL
Floating-point add 1) of Vk rector and place the floating-point result in vector
Vi
Vi rAO,Ak
Vi = M[A0)+Ak Vector Load into elements 0 to VL 1 of vector register Vi from memory
load with stride Ak starting at address AO and incrementing addresses by Ak
Vi rAO,l Vi = M[AO>+1
Load into elements 0 to VL 1 of vector register Vi from memory
Vector load with stride
starting at address AO and incrementing addresses by 1
1
rAO,Ak Vi
Vi = M[A0)+Ak Vector Store elements 0 to VL 1 of vector register Vi in memory
store with stride Ak starting at address AO and incrementing addresses by Ak
rAO,l Vi Vi = M[AO)+1 Store element. 0 to VL 1 of vector register Vi in memory starting
Vector store with stride at address AO and mcrementmE -addresses by 1
1
Completely Partially
Processor Compiler vectorized vectorized Not vectorized
CDC CYBER 205 VAST-2 V2.21 62 5 33
Convex C-series FC5.0 69 5 26
Cray X-MP CFT77 V3.0 69 3 28
Cray X MP CFT V 1.15 50 1 49
Cray-2 CFT2 V3.1a 27 1 72
ETA-10 FTN 77 V 1.0 62 7 31
Hitachi S810/820 FORT77/H AP V2O-2B 67 4 29
IBM <090/VI- VS FOR I RAN V2.4 52 4 44
NEC SX/2 FORTRAN77 / SX V.040 66 5 29
Figure 9.5: Result of applying Vectorising Compilers to the 100 FORTRAN Test
Kernels
The kernels were planned to verify vectorisation ability and are able to be
vectorised by hand.
Self Assessment Questions
10. List two factors which enable a program to run successfully in vector
mode.
11. There does not exist any variation in the capability of compilers to decide
if a loop can be vectorised. (True/False)
Activity 2:
Visit your local computer vendor and get an expert opinion about vector
processors and their working.
9.6 Summary
There are several representative application areas where vector processing is
of the utmost importance. Depending upon the way the operands are fetched,
vector processors can be segregated into two groups.
• Operands are straight away streamed from the memory to the functional
units and outcomes are written back to memory at the time the vector
operation advances in this architecture.
• Operands are read into vector registers wherein they are fed to the
functional units and outcomes of operations are written to vector registers
in this architecture.
• Vector register architectures have several advantages over vector
memory-memory architectures.
• There are several major components of the vector unit of a registerregister
vector machine
• The various types of vector instructions for a register-register vector
processor are:
■ Vector-scalar Instructions
■ Vector-vector Instructions
■ Vector-memory Instructions
■ Gather and Scatter Instructions
■ Masking Instructions
■ Vector Reduction Instructions
• CRAY-1 is one of the oldest processors that implemented vector
processing.
Two issues that arise in real programs: (i) the vector length in a program is not
exactly 64. (ii) Non adjacent elements in vectors that reside in memory.
• The structure of the program & capability of the compiler are two factors
that affect the success with which a program can be run in vector mode.
9.7 Glossary
• ASC: Advanced Scientific Computer
• Data hazards: the conflicts in register accesses
8. Strip mining
9. Sequential words
10. Structure of the program & capability of the compiler
11. False
Terminal Questions
1. There are various application areas of vector processors which are of
considerable importance. Refer Section 9.2.
2. Depending upon the way the operands are fetched, vector processors can
be segregated into two groups: Memory-memory vector architecture and
Vector-register architecture. Refer Section 9.3.
3. Due to the capability to overlap memory accesses as well as the probable
use of vector processors again, vector-register vector processors are
normally more efficient as compared to memory-memory vector
processors. Refer Section 9.3.
4. a. The CDC Cyber 205 is based on the concepts initiated for the CDC Star
100; the first commercial model was produced in 1981. Refer Section
9.4.
b. CRAY-1 is one of the oldest processors that implemented vector
processing. Refer Section 9.5.
c. The vector size may be less than the vector register size, and the
vector size may be larger than the vector register size. Refer Section
9.6.
d. As vectors are one-dimensional series, saving a vector in memory is
direct: vector elements are stored as sequential words in memory.
Refer Section 9.6.
5. The major components of the vector unit of a register-register vector
machine are Vector Registers, Vector Functional Units, Scalar Registers
etc. Refer Section 9.5.
6. The various types of vector instructions for a register-register vector
processor are: (Refer Section 9.5.) a. Vector-scalar Instructions
b. Vector-vector Instructions
c. Vector-memory Instructions
d. Gather and Scatter Instructions
e. Masking Instructions
f. Vector Reduction Instructions
7. Like an indication of vectorisation level which can be acquired in scientific
programs, we should observe the vectorisation levels noted for the Perfect
Club benchmarks. Refer Section 9.7.
References:
• Hwang, K. (1993). Advanced Computer Architecture. McGraw-Hill.
• Godse, D. A. & Godse, A. P. (2010). Computer Organisation. Technical
Publications.
• Hennessy, John L., Patterson, David A. & Goldberg David (2011).
Computer Architecture: A Quantitative Approach, Morgan Kaufmann; 5th
edition.
• Sima, Dezso, Fountain, Terry J. &Kacsuk, Peter (1997). Advanced
computer architectures - a design space approach. Addison-Wesley-
Longman.
E-references:
• https://csel.cs.colorado.edu/~csci4576/VectorArch/VectorArch.html
• http://www.cs.clemson.edu/~mark/464/appG.pdf
• nasa_fig.gif
Unit 10 SIMD Architecture
Structure:
10.1 Introduction
Objectives
10.2 Parallel Processing: An Introduction
10.3 Classification of Parallel Processing
10.4 Fine-Grained SIMD Architecture
An example: the massively parallel processor Programming and
applications
10.5 Coarse-Grained SIMD Architecture
An example: the CMS
Programming and applications
10.6 Summary
10.7 Glossary
10.8 Terminal Questions
10.9 Answers
10.1 Introduction
In the previous unit, you studied about the use and effectiveness of vector
processors. Also, you studied the vector register architecture, vector length
and stride issues. We also learnt the concept of compiler effectiveness in
vector processors. In this unit, we will throw light on the data parallel
architecture, SIMD design space. We will also study the types of SIMD
architecture. The instruction execution in conventional computers is
sequential. So there is a time constraint involved, if one program is being
executed, the other task has to wait for the time till the first one is executed. In
parallel processing, the execution is in a parallel manner, that is at the same
time the program can be divided into segments, while one segment is being
processed, the other can be fetched from the memory and some segments
can be printed (provided they are already processed) all at the same time. The
purpose of parallel processing is to bring down the execution time, hence to
speed up the data processing.
Parallel processing can be established by dividing the data among different
units, each unit being processed simultaneously and the timing and
sequencing being governed by the control unit so as to get the fruitful result in
minimum amount of time.
Objectives:
After studying this unit, you should be able to:
• discuss the concept of data parallel architecture
• describe the SIMD design space
• identify the types of SIMD architecture
• recognise fine grained SIMD architecture
• explain coarse grained SIMD architecture
made to work out a computational problem. It may take the use of multiple
CPUs. A problem is broken into discrete parts that can be solved concurrently.
Each part is further broken down to a series of instructions and instructions
from each part execute simultaneously on different CPUs as shown in figure
10.2.
Figure 10.3: (a) Serial Processing (b) True Parallel Processing with Multiple
Processors (c) Parallel Processing Simulated by Switching one Processor
among Three Processes
Figure 10.3 (a) represents the serial processing means next processing is
started when the previous process must be completed. In figure 10.3 (b) all
three process are running in one clock cycle of three processors. In figure 10.3
(c) all three process are also running in one clock cycle but each process are
getting only 1/3 of actual clock cycle on each clock cycle and the CPU is
switching from on process to other in its clock cycle. When one process is
running all other process must wait for their turn. So if we see in figure 10.3 (c)
then we will find that at one clock time only one process is running and other
are waiting. But in figure 10.3 (b) at one clock time all three process are
running. So in uni-processor system the parallel processing as shown in figure
10.3 (c) is called virtual parallel processing.
SISD S1 MD
Single Instruction, Single Data Single Instruction, Multiple Data
M1 SD M1M D
Multiple Instruction, Single Data Multiple Instruction, Multiple Data
Figure 10.4: Flynn’s Classification of Computer System
In this chapter, our main focus will be Single Instruction Multiple Data (SIMD).
Single Instruction Multiple Data (SIMD)
The term single instruction implies that all processing units execute the same
instruction at any given clock cycle. On the other hand, the term multiple data
implies that each and every processing unit could work on a different data
element. Generally, this type of machine has one instruction dispatcher, a very
big array of very small capacity instruction units and a network of very high
bandwidth. This type is suitable for specialised problems which are
characterised by a high regularity, for example, image processing. Figure 10.5
shows a case of SIMD processing.
Activity 1:
Explore the components of a parallel architecture that are used by an
organisation. Also, find out the type of memory used in that architecture.
10.4 Fine-Grained SIMD Architecture
The Steven Unger design scheme is the initial base for the Fine-grained SIMD
architectures. These are generally designed for low-level image processing
fault occurs during computation, the sequence of instructions following the last
dump to external memory must be repeated after replacement of the fault-
containing column.
The processing elements are linked by a 2-dimension near-neighbour mesh.
This resolution gives a number of important advantages over other likely
alternatives, such as trouble-less data structures maintenance in shifting,
engineering ease, high bandwidth, and a close conceptual match to the
formulation of many image processing calculations.
The principal disadvantage of this system is the sluggish transmission of data
between remote processors in array. However, this can be only seen if
comparatively minute amount of data is to be transmitted (rather than whole
images).
The option of 4 rather than 8-connectedness is perhaps surprising in view of
the minimal increase of complexity which latter involves, compared to a twofold
improvement in performance on some operations. There is one special
purpose staging memory meant for conversion of data format. All extremely
parallel computers have problems related with the data input & output, and in
those parallel computers which represent single-bit processors, the problems
are many and compounded. The problem is that external source data is usually
formatted as one individual string of integers. So, if such a data is utilised in a
two-dimensional array in any simple manner, considerable amount of time is
wasted before successful processing can start, basically because of the
unmatched format of the data.
The MPP included two solutions for this problem. The 1st was a distinct data
input/output register. The 2nd was the staging memory, which allowed
conversion between bit plane & integer string formats. Using jointly, these two
solutions allowed the processor array to function continuously, and so giving
out the maximum output.
10.4.2 Programming and applications
The MPP system was commissioned by NASA principally for the analysis of
Lands at images (satellite imagery of Earth) This meant that, initially, most
applications on the system were in the area of image processing, although the
machine eventually proved to be of wider applicability. At the same time, NASA
also utilised the MPP system for various other applications listed below. See
figure 10.9).
functions (a standard image processing technique) but the third arises due to
the different viewing angles. The MPP algorithm operates as follows:
• For each pixel in one of the images (the reference image) a local
neighbourhood area is defined. This is correlated with the similar area
surrounding each of the candidate match pixels in the second image.
• The measure applied is the normalised mean and variance cross
correlation function. The candidate yielding the highest correlation is
considered to be the best match, and the locations of the pixels in the two
images are compared to produce the disparity value at that point of the
reference image.
• The algorithm is iterative. It begins at low resolution, that is, with large
areas of correlation around each of a few pixels. When the first pass is
complete, the test image is geometrically warped according to the disparity
map.
• The process is then repeated with a higher resolution (usually reducing the
correlation area. and increasing the number of computed matches, by a
factor of two), a new disparity map is calculated and a new warping
applied, and so on.
• The procedure is continued either for a predetermined number of passes
or until some quality criterion is exceeded.
Self Assessment Questions
9. MPP is the acronym for ___________ .
10. All highly parallel computers have problems concerned with
10.5 Coarse-Grained SIMD Architecture
There are several technical difficulties that arise in fulfilling completely the fine-
grained SIMD ideal of one processor per data element. Thus, it is better to
begin with the coarse-grained approach and therefore, develop a more rational
architecture. Currently, a number of parallel computers manufacturers,
including nCUBE and Thinking Machines Inc., have adopted this outlook.
The manufacturers which are more familiar with the mainstream of computer
design than the application-specific architecture field often develop the
Coarse-grained data-parallel architectures. It is the result of MIMD
programmes that have helped discover the complexities of this approach and
seek to mitigate them. The consequences of these roots are systems which
can employ a number of different paradigms including MIMD. Multiple-SIMD
and what is often called single program multiple data (SPMD) in which each
Manipal University Jaipur B1648 Page No. 222
Computer Architecture Unit 1
processor executes its own program, but all the programs are the same, and
so remain in lock-step. Such systems are frequently used in this data-parallel
mode, and it is therefore reasonable to include them within the SIMD
paradigm. Naturally, when they are used in a different mode, their operation
has to be analysed in a different way. Coarse-grained SIMD systems of this
type embody the following concepts:
• Each PE is of high complexity, comparable to that of a typical
microprocessor.
• The PE is usually constructed from commercial devices rather than
incorporating a custom circuit.
• There is a (relatively) small number of PEs, on the order of a few
hundreds or thousands.
• Every PE is provided with ample local memory.
• The interconnection method is likely to be one of lower diameter and lower
bandwidth than the simple two-dimensional mesh. Networks such as the
tree, the crossbar switch and the hypercube can be utilised.
• Provision is often made for huge amounts of relatively high-speed, high-
bandwidth backup storage, often using an array of hard disks.
• The programming model assumes that some form of data mapping and
remapping will be necessary, whatever the application.
• The application field is likely to be high-speed scientific computing.
This type of systems have a number of advantages as compared to finegrained
SIMD, such as the capability to take maximum advantage from latest
processor technology, the aptitude to perform highly precise computations with
no performance penalty and the easier mapping to a selection of different data
types which the lesser number of processors and improved connectivity
permits.
The software required for such systems offers an advantage as well as a
disadvantage at the same time. The advantage lies in its closer similarity to
normal programming: the disadvantage lies in the less natural programming
for some applications. Coarse-grained systems also offer greater variety in
their designs, because each component of the design is less constrained than
in a tine-grained system. The example given below is, therefore, less
specifically representative of its class than was the MPP machine considered
earlier.
There are three aspects to the system design which are of major importance.
The first is the data interconnection network, shown in figure 10.11, which is
designated by the designers a fat tree network. It is based upon the quadtree,
augmented to reduce the likelihood of blocking within the network.
Thus, at the lowest level, within what is designated an interior node of four
processors, there are at least two independent direct routes between any pair
of processors. Utilising the next level of tree, there are at least four partly
independent routes between a pair of processors. This increase in the number
of potential routes is maintained for increasing numbers of processors by
utilising higher levels of the tree structure.
Although this structure provides a potentially much higher band-width that the
ordinary quadtree, like any complex system, achieving the highest
performance depends critically on effective management of the resource. The
second component of system design which is of major importance is the
method of control of the processing elements. Since each of these
incorporates a complete microprocessor, the system can be used in fully
asynchronous MIMD mode. Similarly, if all processors execute the same
program, the system can operate in the SPMD mode. In fact, the designers
suggest that an intermediate method is possible, in which processing elements
act independently for part of the time, but are frequently resynchronised
globally. This technique corresponds to the implementation of SIMD with
algorithmic processor autonomy.
The final system design aspect of major importance is the data I/O method.
The design of the CM5 system seeks to overcome the problem of improving
(and therefore variable) disk access speeds by allowing any desired number
of system nodes to be allocated as disk nodes with attached backup storage.
This permits the amount and bandwidth of I/O arrangements to be tailored to
the specific system requirements. Overall, one of the main design aims, which
was pursued for the CM5 system was scalability. This not only means that the
number of nodes in the system can vary between (in the limits) one and 16384
processors, but that system parameters such as peak computing rate, memory
bandwidth and I/O bandwidth all automatically increase in the proportion of
processing elements. This is shown in the Table 10.2.
Table 10.2: CM5 System Parameters
Number of processor 32 1024 16384
Number or data paths 128 4096 65 536
Peak speed (MFLOPS) 4 128 2048
Memory (Gbyte) 1 32 512
Memory bandwidth (Gbyte/s) 16 512 8 192
1/0 bandwidth (Gbyte/s) 0.32 10 160
Synchronisation time (us) 1.5 3.0 4.0
Activity 2
Visit an organisation and find out the difficulties that are faced by the
computer designers in implementing and operating the fine-grained and
coarse-grained SIMD architectures.
10.6 Summary
Let us recapitulate the important concepts discussed in this unit:
10.9 Answers
Self Assessment Questions
1. Instructions
2. Parallel Computer
3. True
4. Simulated Or Virtual
5. Single-Instruction, Multiple Data
6. SISD, SIMD, MISD, and MIMD
7. Vector Processing
8. Instruction-Level Parallelism
9. Massively Parallel Processor
10. Input and Output Of Data
11. Ncube and Thinking Machines Inc.
12. Single Program Multiple Data
13. Cm1
14. Cm5
Terminal Questions
1. Parallel Computing is the simultaneous use of multiple compute
resources to solve a computational problem. Refer Section 10.2.
2. The core element of Parallel Processing is Cpus. The essential
computing process is the execution of sequence of instruction on Asset of
Data. Refer Section 10.3.
3. The Steven Unger Design Scheme is the initial base for the fine-grained
11.8 Glossary
11.9 Terminal Questions
11.10 Answers
11.11 Introduction
In the previous unit, you were introduced to data parallel architecture in which
you studied the SIMD part. You learned about SIMD architecture and its
various aspects like SIMD design space, fine-grained SMID architecture and
coarse gained SIMD architecture. In this unit we will progress a step further to
explain the MIMD architecture. Although we have covered vector architecture
in prior unit, we will throw some light on it as well so that the concept of MIMD
can be understood in a better way.
According to famous computer architect Jim Smith, the most efficient way to
execute a vectorisable application is a vector. Vector architectures are
responsible for collecting the group of data elements distributed in memory
and after that placing them in linear sequential register files. After placing,
operation starts on that data present in register files and the result is dispersed
again to the memory. On the other hand, MIMD architectures are of great
importance and may be used in numerous application areas such as CAD
(Computer Aided Design), CAM (Computer Aided Manufacturing), modelling,
simulation etc
In this unit, we are going to study different features of Vector architecture and
MIMD architecture such as pipelining, MIMD architectural concepts, problems
of scalable computers, Main design issues of scalable MIMD architecture.
Objectives:
After studying this unit, you should be able to:
• recall the concept of vector architecture
• discuss the concept of pipelining
• describe MIMD Architectural concepts
• differentiate between multiprocessor and multicomputer
• interpret the problems of scalable computers
• solve the problems of scalable computers
• recognise the main design issues of scalable MIMD architecture
11.2 Vectorisation
Vector machines are planned & designed to operate at the level of vectors.
Manipal University Jaipur B1648 Page No. 231
Computer Architecture Unit 1
Now, you will study the operation of vectors. Suppose there are 2 vectors, A
and B, both having 64 components. The components present in a vector are
the vector size. So our vector size is 64. Vector A and B are shown below:
JJ = />0, ................. j .
Now we want to add these 2 vectors and keep the result in another vector C.
It is shown in the equation below. The rule for adding the vector is to add the
corresponding components.
which performs
Vd = Vs1 VOP Vs2
Here, VOP is the vector operation which is performed on registers Vs1 and
Vs2 and result is stored at Vd.
Architecture
As discussed in unit 9, both the scalar and vector unit is present in a vector
machine. The scalar unit has the same structural design as the conventional
processor and it works on scalars. Similarly vectors works on vector
operations. Advancements like moving from CISC to RISC designs, and
moving from the memory-memory architecture to the vector-register
architecture has been seen in vector architecture.
Manipal University Jaipur B1648 Page No. 232
Computer Architecture Unit 1
Vector Registers: Vector registers carries the input and result vectors. 8
vector registers are present in Cray 1 and many other vector processors. Every
vector register contains 64 elements of 64 bits each. For example Fijitsu VP
200 processor permits the space of 8k elements present in vector register’s
programmable set whose range is 8 to 256. As 8 vector register carries 64
elements of 64 bits, but 256 register carry 32 elements.
Figure 11.1 contains 1 write port and 2 read port so that vector operations can
overlap on various vector registers. Scalar Registers: Vector operations get
the scalar inputs present in scalar registers. Such as a scalar register results
constant when elements are multiplied to matrix.
B=5*X+Y
In the above equation, 5 is a constant stored in scalar register and X and
vectors in 2 different vector register. Address calculation of vector load/store
unit is also done in this register. Vector Load/Store Unit: Data moves
11.3 Pipelining
We have discussed this concept in Unit 4 and 5, but we need to recap it in
order to get a better idea of the next sections.
What is Pipelining?
An implementation technique by which the execution of multiple instructions
can be overlapped is called pipelining. In other words, it is a method which
breaks down the sequential process into numerous sub-operations. Then
every sub-operation is concurrently executed in dedicated segments
separately. The main advantage of pipelining is that it increases the instruction
throughput, which is specified the count of instructions completed per unit
time. Thus, a program runs faster. In pipelining, several computations can run
in distinct segments simultaneously.
A register is connected with every segment in the pipeline to provide isolation
between each segment. Thus, each segment can operate on distinct data
Manipal University Jaipur B1648 Page No. 235
Computer Architecture Unit 1
11.4.1 Multiprocessor
Multiprocessor are systems with multiple CPUs, which are capable of
independently executing different tasks in parallel. They have the following
main features:
• They have either shared common memory or unshared distributed
memories.
• They also share resources for example I/O devices, system utilities,
program libraries, and databases.
• They are operated on integrated operating system that gives interaction
among processors and their programs at the task, files, job levels and also
in data element level.
Types of multiprocessors
There are 3 types of multi-processors they are distributed in the way in which
shared memory is implemented. (See figure 11.3). They are:
• UMA (Uniform Memory Access),
Basically the memory is divided into several modules that is why large
multiprocessors into different categories. Let’s discuss them in detail.
UMA (Uniform Memory Access): In this category every processor and
memory module has similar access time. Hence each memory word can be
read as quickly as other memory word. If not then quick references are
slowed down to match the slow ones, so that programmers cannot find the
difference this is called uniformity here. Uniformity predicts the performance
which is a significant aspect for code writing. Figure 11.4 shows uniform
memory access from the CPU on the left.
Modern UMA machines are of small size and with single bus multiprocessors.
In the early design of scalable shared memory systems, large UMA machines
with a switching network and hundreds of processors were common.
Well-known examples of those multiprocessors are the NYU Ultra computer
as well as the Denelcor HEP. In their designs numerous features had been
introduced which act as an important achievement in today’s parallel
computers architecture. Nevertheless, early systems do not have local main
memory or cache memory, which has showed its importance for attaining high
performance in scalable shared memory systems. It is not appropriate for
building scalable parallel computers but it is very good for constructing small-
sized single bus multi-processor. Such as Encore Multimax of Encore
Computer Corporation introduced in late 80s and Silicon Graphics Computing
Systems introduced in late 90s.
NUMA (Non Uniform Memory Access): They are intended for avoiding the
memory access disadvantage of Uniform Memory Access machines. The
logically shared memory is spread between all the processing nodes of NUMA
machines, giving rise to distributed shared memory architectures.
Figure 11.4 shows non uniform memory between the left and right disks.
Although these parallel computers became highly scalable, yet they are
extremely sensible for data allocation in local memories. Accessing a remote
memory segment of a node is slower as compared to accessing a local
memory segment of a node. Multi-computers having distributed memory are
similar to the architecture of these machines. Major dissimilarity depends on
the organization of address space. In multiprocessors, a global address space
that is equally visible from each processor is applied.
In other words, all the memory locations can be accessed by CPU clearly. In
local memories of multi-computers, the address space is duplicated in the
processing elements. This dissimilarity in the memory’s address space is all
well showed in software level. NUMA machines programming depends on the
global address space (shared memory) principle while distributed memory
multi-computers programming depends on the message-passing paradigm.
COMA (Cache Only Memory Access): You can say that COMA machine act
as non-uniform but differently. It also avoids the effects of static memory
allocation of NUMA and Cache Coherent Non-Uniform Memory Architecture
(CC-NUMA) machines. This is done by doing two activities;
Activity 1:
Prepare a collage depicting two columns, one for multiprocessor while the
other for multi-computer and paste diagrams, notes, pictures etc of the
various machines found under each heading.
The pointers to A and B are rA and rB stored in the local memory of P0. Access
of A and B are realised by the "rloadrA" and "rloadrB" instructions that should
travel through the interconnection network in order to fetch A and B.
The situation is even worse if the values of rA and rB are currently not available
in M1 and Mn. M1 and Mn will be generated by some other process which will
be executed later on. In this case where idling occurs due to synchronisation
among parallel processes, the original process on P0 should wait
unpredictable time resulting in unpredictable latency.
Solutions to the problems
In order to solve the above-mentioned problems several possible
hardware/software solutions were proposed and applied in various parallel
computers. They are as follows:
1. Application of cache memory
2. Pre-fetching
instruction from thread 2 and so on. This is the way through which
processor can be kept busy even in lengthy latencies for independent
threads. Actually for switching processes after each instruction to reduce
latencies, some systems automatically switch between the processes.
4. Non-blocking writes: Last method for reducing or hiding latency is non-
blocking wires. In this method the memory operation starts but the
program will continue executing. But normally when a STORE instruction
is carried out, at that time CPU waits till the instruction completes before
continuing.
Activity 2:
Visit a library and read books on computer architecture to find out more
ways of resolving the problems of scalable computers.
11.8 Glossary
• CC-NUMA machine: Cache-Coherent Non-Uniform Memory Access
machine.
• COMA machine: Cache Only Memory Access machine.
• COW: Cluster of Workstations.
• DDM: Data Diffusion Machine.
• LMD: Load Memory Data.
• MPPs: Massively Parallel Processors.
• Multi-computer: It contains numerous von Neumann computers that are
associated with interconnected network.
• Multiprocessor: Systems with multiple CPUs, which are capable of
independently executing different tasks in parallel.
• NORMA: NO Remote Memory Access.
• NOW: Network of Workstations.
• NUMA machine: Non Uniform Memory Access machine.
• Register: It is associated with each segment in the pipeline to provide
isolation between each segment.
• UMA machine: Uniform Memory Access machine.
11.10 Answers
Self Assessment Questions
1. CDC Star 100
2. Vector
3. Instruction throughput
4. Virtual parallelism
5. True
6. MIMD
7. False
8. Non Uniform Memory Access
9. Remote loads
10. False
11. Interconnection network design
12. Physically shared memory
Terminal Questions
1. A vector may refer to a type of one dimensional array. Vectorisation is
collecting the group of data elements distributed in memory and after that
placing them in linear sequential register files. Refer Section 11.2.
2. The five components are vector registers, scalar registers, vector
functional units, vector load/store unit, and main memory. Refer Section
11.2.
References:
• Hwang, K. Advanced Computer Architecture. McGraw-Hill, 1993.
• Godse, D. A. & Godse, A. P. Computer Organization. Technical
Publications.
• Hennessy, John L., Patterson, David A. & Goldberg David, Computer
Architecture: A Quantitative Approach, Morgan Kaufmann; 5th edition,
2011.
• Sima, Dezso, Fountain, Terry J. & Kacsuk, Peter, Advanced computer
architectures - a design space approach. Addison-Wesley-Longman: I-
XXIII, 1-766.
E-references:
• http://www.cs.umd.edu/class/fall2001/cmsc411/projects/MIMD/mimd.
html.
• http://www.docstoc.com/docs/2685241/Computer-Architecture-
Introduction-to-MIMD-architectures.
12.1 Introduction
In the previous unit, you studied the concept of vectorisation and pipelining.
Also, you studied the MIMD architectural concepts, problems of scalable
computers. We also learnt the main design issues of scalable MIMD
architecture.
A computer must have a system to get information from the outside world and
must be able to communicate results to the external world. It is required to
enter programs as well as data in computer memory in order to process them.
Also, it is required to record or display the results (for the user) received from
calculations. To use computer in an efficient manner it is necessary to prepare
numerous programs as well as data beforehand. Then these programs and
data are broadcasted into storage medium. Then, the information available in
disk is transmitted into the memory of a computer in a rapid manner.
surface. Within the drive, a disk spins rapidly past the read/write head each as
shown in figure 12.1. A floppy disk, for example, may spin past the read/write
head at 300 revolutions per minute (RPMs), whereas a hard drive may spin at
3,600 to 10,000 RPMs.
hard drive and a floppy-disk drive that lets insert and remove disks. Normally,
PC’s hard drive resides within PC’s system unit.
Tape drive: The function of tape drive is to read as well as write the data to
tape surface. An audio cassette also functions in the same way. The only
dissimilarity is that a tape drive burns digital data. Tape storage generally
stores data which is not needed frequently. The example of such data is
backup copies related to the hard disk. Tape drive is required to write data in
a serial manner. This is due to the reason that a tape appears as a long strip
which is made up of magnetic material. Direct access offered by media like
disks appears to be faster as compared to the process of tape drive which
writes data serially.
When it is required to access the particular data on a tape, then the drive starts
scanning through the entire data. That is, the data which is not required is also
going through scanning. Thus, this has an effect in slow access time. Access
time differs according to the speed with which the drive is accessing, position
available on the tape in addition to the length of tape.
12.3.2 Optical storage
A kind of optical storage which is most extensively used is called as the CD
(compact disk). Compact disk is utilised in CDR, CDRW, CD-ROM, DVD-
ROM, in addition to Photo CD systems. Nowadays, systems with DVD-ROM
drives are preferred, rather than standard CD-ROM units. The devices
included in optical storage are used to store the data over reflective surface.
Additionally, a ray of laser light is used to read the data. A thin ray of light is
directed and concentrated by means of lenses, mirrors, and prisms. All the
light having same wavelength helps in creating laser beam focus.
CD-ROM: It symbolises compact disk read only memory. To read data from
CD-ROM, a laser beam is directed on the surface of a spinning disk. The areas
that reflect back the light are read as 1s, and the ones that scatter the light
and do not reflect back are read as 0s. This is shown in figure 12.3.
Data on this device is stored in a long spiral starting from the disk edge. Also
it’s ending take place at the centre.
control lines
A general-purpose computer makes use of printer, magnetic disk. In
computers, magnetic tape is utilised for backup storage. All peripheral devices
are connected with it by means of interface unit.
All interfaces decode the address as well as control obtained from I/O bus.
Every interface decodes them for peripheral. Also it offers signals for
peripheral controller. Data flow is synchronised and the transfer among
processor and peripheral is administered. Every peripheral comprises its
individual controller. A specific electromechanical device is operated by this
controller. For instance, paper movement, print timing in addition to the
printing characters selection are controlled by means of printer controller.
Perhaps, a controller is stored individually or is physically incorporated with
peripheral. Input-Output bus from processor is connected every peripheral
interface.
It is required for the processor to place the address of a device on address
lines in order to converse with a device. Every interface which is connected to
I/O bus includes address decoder. The function of address decoder is to
monitor address lines. The path among bus lines and device that are
controlled by interface gets activated, when the interface identifies its own
address. The interface disables those peripherals whose address is not
matching with the address in bus. Address lines contain the address. Also, a
function code is provided by processor in control lines.
An interface chosen replies to the function code. Then and continues to
implement it. You can consider function code as an Input-Output command.
Basically, the instruction which is carried out in interface and it is connected in
peripheral unit is known as a function code.
Interface may obtain different kinds of commands. The different kinds of
commands are:
• Control: We give a control command to activate the peripheral.
Particularly, a control command given relies on peripheral. Every
peripheral obtains its own differentiated series of commands, according to
its operation mode.
• Status: This command is used for testing different conditions of status
in peripheral as well as interface. For instance, before initiating a transfer,
computer may want to verify the peripheral’s status. When the transfer is
going on, some errors may take place. These errors are observed by
interface.
• Data output: In this command, the interface responds by transmitting
data. Data is transmitted from bus into any of its registers. As an example,
consider a tape unit. By means of a control command, the computer
begins the tape moving. Then the status of tape is monitored by processor.
This is done by using status command.
• Data input: By giving this command, interface obtains a data item from
peripheral. This data item is placed in buffer register of interface. The
availability of data is checked by the processor. This is done by using
status command. Then, data input command is issued. Here, the interface
puts the data on data lines. Also the data gets accepted by the processor.
12.4.1 Input-Output vs. memory bus
It is required for processor to converse with memory unit in order to converse
with I/O. Memory bus consists of the following:
• data
• address
• read/write control lines
Computer buses can communicate with I/O and memory by using the following
techniques:
• Make use of two different buses, the first bus for memory and the
second bus for I/O.
• Make use of a common bus for I/O as well as memory. However
different control lines should be there for each.
• Make use of a common bus for I/O as well as memory having common
control lines.
In case of first technique, the computer comprises the following:
• data
• address
• control buses, one bus for I/O and other for accessing memory
This is performed in computers having an individual IOP (input-output
Processor) (IOP) and CPU (Central Processing Unit). By means of a memory
bus, the memory converses with central processing unit as well as input-
output processing. IOP also converses with input as well as output devices.
This is done through an individual I/O bus having its individual data, address
1 0 0 Port A register
1 0 1 Port ft register
1 1 0 Control register
1 1 1 Status register
The chip select and register select inputs determine the address assigned to
the interface. The I/O read and writes are two control lines that specify an input
or output, respectively. The four registers: Port A Register, Port B register,
Control Register and Status register communicate directly with the I/O device
attached to the interface. The input-output data to and from the device can be
transferred into either port A or port B.
The interface may operate with an output device or with an input device, or
with a device that requires both input and output. If the interface is connected
to a printer, it will only output data, and if it services a character reader, it will
only input data. A magnetic disk unit is used to transfer data in both directions
but not at the same time, so the interface can use bidirectional lines. A
command is passed to the I/O device by sending a word to the appropriate
interface register.
In a system like this, the function code in the I/O bus is not needed because
control is sent to the control register, status information is received from the
status register, and data are transferred to and from ports A and B registers.
Thus the transfer of data, control, and status information is always via the
common data bus.
The distinction between data, control, or status information is determined from
the particular interface register with which the CPU communicates. The control
register gets control information from the CPU. By loading appropriate bits into
the control register, the interface and the I/O device attached to it can be
placed in a variety of operating modes. For example, port A may be defined
as an input port and port B as an output port, A magnetic tape unit may be
instructed to rewind the tape or to start the tape moving in the forward
direction. The bits in the status register are used for status conditions and for
recording errors that may occur during the data transfer. For example, a status
bit may indicate that port-A has received a new data item from the I/O device.
The interface registers uses bi-directional data bus to communicate with the
CPU. The address bus selects the interface unit through the chip select and
the two register select inputs. A circuit must be provided externally (usually, a
decoder) to detect the address assigned to the interface registers. This circuit
enables the chip select (CS) input to select the address bus. The two register
select-inputs RSl and RSO are usually connected to the two least significant
lines of the address bus. Out of those two inputs, select one of the four
registers in the interface as specified in the table accompanying the diagram.
The content of the selected register is transferred into the CPU via the data
bus when the I/O read signal is ended. The CPU transfers binary information
into the selected register via the data bus when the I/O write input is enabled.
Self Assessment Questions
7. ________ is used in computers for backup storage.
8. ________ from the processor is attached to all peripheral interfaces.
9. A ____________ is issued to test various status conditions in the
Activity 1:
Visit an IT organisation and observe the functioning of the I/O interface and
the data lines, control lines, and I/O bus architecture. Also, check whether
the I/O system used is isolated or memory-mapped.
availability of 100 disks much higher than that of a single disk. These systems
have become known by the acronym RAID, which stands for redundant array
of inexpensive disks. We will study this topic in the next section.
Self Assessment Questions
10. ____________ refers to consistent reporting when information is lost
because of failure.
11. ____________ is an innovation that improves both availability and
performance of storage systems
12.6 RAID
RAID is the acronym for ‘redundant array of inexpensive disks’. There are
several approaches to redundancy that have different overhead and
performance. The Patterson, Gibson, and Katz 1987 paper introduced the
term RAID. It used a numerical classification for these schemes that has
become popular; in fact, the non-redundant disk array is sometimes called
RAID 0.One disadvantage is discovering when the disk fails. Magnetic disks
help provide information about their correct operation. There is information
recorded in each sector which helps detect the errors in that sector.
Transferring of sectors will help the electronics attached to discover the failure
of disks or loss of information.
The levels of RAID are as follows:
12.6.1 Mirroring (RAID 1)
Mirroring or shadowing is the traditional solution to disk failure. It uses twice
as many disks. Data is simultaneously written on two disks, one non-
redundant and one redundant disk so that there are two copies of the data.
The system goes to the mirror disk in case one disk fails to get the required
information. This technique is the most expensive solution.
12.6.2 Bit-Interleaved parity (RAID 3)
Bit-Interleaved parity is an error detection technique where character bit
patterns are forced into parity so the total number of one (1) bit is always odd
or even. This is done by adding a “1” or “0” bit to each byte as the
character/byte is transmitted. At the other end of the transmission the parity is
checked for accuracy. BIP is also a method used at the physical layer (high
speed transmission of binary data) level to monitor errors.
The cost of higher availability can be reduced to 1/N, where N is the number
Manipal University Jaipur B1648 Page No. 266
Computer Architecture Unit 1
4 supports mixtures of large reads, large writes, small reads and small writes.
One drawback to the system is that the parity disk must be updated on every
write, so it is the bottleneck for sequential writes.
To fix the parity-write bottleneck, the parity information is spread throughout
all the disks so that there is no single bottleneck for writes. This distributed
parity organisation is RAID 5. Figure 12.6 shows how data is distributed in
RAID 4 and RAID 5.
The two most common measures of I/O performance are diversity and
capacity.
• Diversity: Which I/O devices can connect to the computer system?
• Capacity: How many I/O devices can connect to a computer system?
Other traditional measures to performance are throughput (sometimes called
bandwidth) and response time (sometimes called latency). Figure 12.7 shows
the simple producer-server model. The producer creates tasks to be
performed and places them in a buffer; the server takes tasks from the first-
in-first-out buffer and performs them.
Response time is defined as the time taken by a task since it is placed in the
buffer till it is completed by the server. Throughput, in simple words, is the
average number of tasks completed by the server over a period of time. To
reach the maximum level of throughput, the server should never be idle, and
the buffer should never be empty. Whereas, response time is the time spent
in the buffer and is minimised when the buffer is empty.
Improving performance does not always mean improvements in both
response time and throughput. Throughput is increased by adding more
servers as shown in figure 12.8, as it helps spread data across two disks
instead of one. This enables the tasks to be performed parallelly.
Unfortunately, this does not help response time, unless the workload is held
constant and the time in the buffers is reduced because of more resources.
Figure 12.8: Single-Producer Model Extended with another Server and Buffer
How does the architect balance these conflicting demands? If the computer
is interacting with human beings, figure 12.9 suggests an answer.
Workload
Conventional interactive workload
(1.0 sec. system response
time)
High-function graphics
workload (1.0 sec. system
response time)
High-function graphics
workload (0.3 sec. system
response time)
This figure presents the results of two studies of interactive environments: one
keyboard oriented and one graphical. An interaction, or transaction, with a
computer is divided into three parts:
1. Entry time - The time for the user to enter the command. The graphics
system in figure 12.9 took 0.25 seconds on average to enter a command
versus 4.0 seconds for the keyboard system.
2. System response time - The time between when the user enters the
command and the complete response is displayed.
3. Think time - The time from the reception of the response until the user
begins to enter the next command.
The sum of these three parts is called the transaction time. Several studies
report that user productivity is inversely proportional to transaction time;
transactions per hour are a measure of the work completed per hour by the
user.
The results in figure 12.9 show that reduction in response time actually
decreases transaction time by more than just the response time reduction:
Cutting system response time by 0.7 seconds saves 4.9 seconds (34%) from
the conventional transaction and 2.0 seconds (70%) from the graphics
transaction. This implausible result is explained by human nature: People need
less time to think when given a faster response.
Self Assessment Questions
14. _______ is also known as bandwidth.
15. ________________ is sometimes called latency.
Activity 2:
Visit an organisation. Find the level of reliability, availability and dependability
of the system used. Also, measure the I/O performance.
12.8 Summary
Let us recapitulate the important concepts discussed in this unit:
• A computer must have a system to get information from the outside world
and must be able to communicate results to the external world.
• Each time PC is shut down, the contents of the PC’s random-access
12.9 Glossary
• Bus Interface: Communication link between the processor and several
peripherals.
• CD-R: Compact disk recordable
• CD-ROM: Compact disk read only memory
• CD-RW: Compact disk rewritable
• DVD-ROM: Digital video (or versatile) disk read only memory
• Input devices: Computer peripherals used to enter data into the computer.
• Input-Output Interface: This gives a method for transferring information
between internal memory and I/O devices.
• Input-Output Processor (IOP): An external processor that
communicates directly with all I/O devices and has direct memory access
capabilities.
• Output devices: Computer peripherals used do get output from the
computer.
• RAM: Random Access Memory
• RPM: Revolutions per Minute
12.10 Terminal Questions
12.11 Answers
Self Assessment Questions
1. Computer architecture
2. Computer organisation
3. Random access memory
4. Storage media
5. Revolutions per minute
6. CD-ROM
7. Magnetic tape
8. I/O bus
9. Status command
10. Data integrity
11. Disk array
12. Redundant array of inexpensive disks
13. Mirroring or shadowing
14. Throughput
15. Response time
Terminal Questions
1. To keep information from one computer session to another, one must store
the information within a file that ultimately stores on disk. This is called
storage system. Refer Section 12.2.
2. There are two main categories of the storage devices: magnetic storage
and optical storage. Refer section 12.3.
3. Peripherals connected to a computer require special communication links
for interfacing them with the CPU. This is done through i/o bus which
connects the peripheral devices to the CPU. Refer section 12.4.
4. Memory transfer and I/O transfer differs in that they use separate read and
References:
• Kai Hwang, Advanced Computer Architecture, Parallelism, Scalablility,
Programmability, Mgh.
• Micheal J. Flynm, Computer Architecture, Pipelined & Parallel Processor
Design, Narosa.
• J.P. Haycs: Computer Architecture & Organisation - Mgm
• Nicholas P. Carter, Schaum’s Outline Of Computer Architecture, Mc.
Graw-Hill Professional.
E-references:
• www.es.ele.tue.nl
• www.stanford.edu
• ece.eng.wayne.edu
13.1 Introduction
In the previous unit, you studied about storage systems. You covered various
aspects such as types of storage devices, connecting I/O devices to
CPU/memory, reliability, availability and dependability, RAID, I/O performance
measures. Multithreading is a type of multitasking. Prior to Win32, the only
type of multitasking that existed was the cooperative multitasking, which did
not have the concept of priorities. The multithreading system has a concept of
priorities and therefore, is also called background processing or pre-emptive
multitasking.
Dataflow architecture is in direct contrast to the traditional Von Neumann
architecture or control flow architecture. Although dataflow architecture has
not been used in any commercially successful computer hardware, it is very
13.2 Multithreading
Multithreading is the capability of a processor to do multiple things at one time.
The Windows operating system uses the API (Application Programming
Interface) calls to manipulate threads in multithreaded applications. Before
discussing the concepts of multithreading, let us first understand what a thread
is.
13.2.1 What is a thread?
Each process, which occurs when an application is run, consists of at least
one thread that contains code. All the code within a thread, when it is active,
is performed consecutively, one line after another. In a multithreading system,
many threads belonging to that particular process run concurrently. A thread
is viewed as an independent program counter within a process and the
location of the instruction that the thread is operating on is indicated by this. A
thread has the following features:
• A state of thread execution
Although these tasks can be performed using more than one process, it is
generally more efficient to use a single multithreaded application because the
system can perform a context switch more quickly for threads than processes.
Moreover, all threads of a process share the same address space and
resources, such as files and pipes.
13.2.3 Benefits of multithreading system
A multithreading system provides the following benefits over a multiprocessing
system:
• Threads advance the communication between different execution traces
as the same user address space is shared.
• In an existing process, creating a new thread is much less timeconsuming
than creating a brand-new process.
• Termination of thread also takes less time.
• Also, control switching among two threads within a same process takes
less time than switching between two processes.
Self Assessment Questions
1. The multithreading system has a concept of priorities and therefore, is
also called __________ or __________ .
2. It takes much more time to create a new thread in an existing process
than to create a brand-new process. (True/False)
Activity 1:
Visit a library and find out more details about the various models of
multithreading like Blocked model, Forking model, Process-pool model and
Multiplexing model.
recent operating systems, such as Solaris, Linux, Windows 2000 and OS/2,
support multiple processes with multiple threads per process. However, the
traditional operating system, MS-DOS supports a single user process and a
single thread. Some traditional UNIX systems are multiprogramming systems
as they maintain multiple user processes but only one execution path is
allowed for each process.
Self Assessment Questions
5. Almost all business applications are _________________ .
6. When an application is run, each of the processes contains at least one
system limits the overall speed of a computer system since the speed of these
components have improved at a slower rate than CPU technology.
The average speed of memory systems can be improved by caches by
keeping the most commonly used data in a fast memory that is near to the
processor. One more factor obstructing CPU speed boosts is the naturally
sequential character of the Von Neumann instruction implementation. Now,
through parallel processing architectures, methods of executing various
instructions concurrently are being developed.
13.6.1 Organisation and operation of the Von Neumann architecture
As mentioned in section 13.6, the core of a computer system with Von
Neumann architecture is the CPU. This element obtains (i.e., reads)
instructions and data from the main memory and coordinates the entire
carrying out of every instruction. It is usually structured into two different
subunits: the Arithmetic and Logic Unit (ALU), and the control unit. Figure 13.2
shows the basic components of a Von Neumann model.
The ALU merges and converts data using arithmetic operations, such as
addition, subtraction, multiplication, and division, and logical operations, such
as bit-wise negation, AND, and OR.
The control unit interprets the instructions retrieved from the memory and
manages the operation of the whole system. It establishes the sequence in
which instructions are carried out and offers all of the electrical signals
essential to manage the operation of the ALU and the interfaces to the other
system components.
The memory is a set of storage cells, and each of this can be in one of two
different states. One state signifies a value of “0”, and the other state signifies
a value of “1”. By separating these two unlike logical states, each cell is
proficient of storing a distinct binary digit, or bit, of information. These bit
storage cells are analytically arranged into words, each of which is b bits wide.
Every word is allotted a unique address in the range [0, .................. , N - 1].
The CPU spots the word that it requires either to read or write by storing its
distinctive address in a special memory address register (MAR). A register
provisionally stores a value within the CPU. The memory acts in response to
a read request by interpreting the value stored at the desired address and
transferring it to the CPU via the CPU-memory data bus. The value is then for
the short term stored in the memory buffer register (MBR) (also sometimes
called the memory data register) before it is used by the control unit or ALU.
For a write operation, the CPU stores the value it desires to write into the MBR
and the corresponding address in the MAR. The value is then copied by the
memory from the MBR into the address pointed to by the MAR.
At last, the input/output (I/O) devices connect the computer system with the
exterior world. These devices let the programs and data to be entered into the
system and give a way for the system to manage an output device. Each I/O
port has a distinctive address to which the CPU can either read or write a
value. From the CPU's opinion, an I/O device is accessed similar to the way it
accesses memory. In fact, in a number of systems the hardware makes it
appear to the CPU that the I/O devices are memory locations. This
configuration, in which no difference between memory and I/O devices is seen
by the CPU, is referred to as memory-mapped I/O. In this case, no distinctive
I/O instructions are necessary.
13.6.2 Key features
In a basic organisation, processors having Von Neumann architecture are
differentiated from simple pre-programmed (or hardwired) controllers as they
posses several key features. First, the same main memory stores both
instructions and data. Consequently, instructions and data are not
distinguished. Also, different types of data, such as a floating-point value, or a
character code, an integer value, are all not distinguished. A particular bit
pattern’s explanation completely depends on how the CPU infers it. The same
data stored at a particular memory location can be inferred as an instruction
or data at different times. For example, when a compiler executes, it reads the
source code of a program written in a high-level language, such as FORTRAN
or COBOL, and is converted into a series of instructions that can be executed
by the CPU. The memory stores the output of the compiler like any other type
of data. On the other hand, the compiler output data can be implemented by
the CPU simply by interpreting them as instructions. Thus, the same values
accumulated in memory are considered as data by the compiler, but are then
taken as instructions executable by the CPU. Another outcome of this theory
is that every instruction must indicate how it deduces the data upon which it
functions. Therefore, suppose, Von Neumann architecture will have one set of
arithmetic instructions for functioning on integer values and another set for
functioning on floating-point values.
The second chief factor says that memory is retrieved by name (i.e., address)
irrelevant of the bit pattern stored at each address. Due to this feature, we can
interpret the values stored in memory as addresses or data or instructions.
Therefore, programs can alter addresses via the same set of instructions that
the CPU uses to alter data. This elasticity of how values in memory are read
permits very compound, vigorously changing patterns to be produced by the
CPU to access any range of data structure in spite of the kind of value being
read or written. Ultimately, an additional chief feature of the Von Neumann
scheme is that the sequence in which a program performs its instructions is
sequential, unless that order is openly altered. Program counter (PC), a
special register in the CPU, carries the address of the following instruction in
memory to be performed. After each instruction is carried out, the value in the
PC is increased to point to the following instruction in the series to be
implemented. This sequential implementation series can be transformed by
the program with the help of branch instructions that stores a fresh value into
the PC register.
On the other hand, special hardware can sense some outside event, such as
a suspension, and load a fresh value into the PC to cause the CPU to
commence executing a new series of instructions. Though this concept of
executing one action at a time really makes simpler the writing of programs
and the design and running of the CPU, it also limits the prospective
performance of this architecture.
Self Assessment Questions
9. The instruction set together with the resources needed for their execution
Activity 2:
Surf the internet to find out details about architecture called Harvard
architecture and compares it with Von Neumann architecture.
defined functionality.
Data channels provide the sole mechanism by which nodes can interact and
communicate with each other by ensuring lower coupling and greater
reusability. Data channels can also be implemented transparently between
processors to carry messages between components that are physically
distributed. In the dataflow architecture, a control application is composed of
function bodies and data channels, and the connections between function
bodies and data channels are described in a dataflow graph. Consequently,
designing a control application mainly involves constructing such a dataflow
graph by selecting function bodies from the design library and connecting them
together.
Additional user-defined or application-specific function bodies are also easily
supported. A model of dataflow programming is shown in figure 13.3.
Some models were extended with "Sticky tokens", the tokens that stay much
like a constant input and match with tokens arriving on other inputs. Nodes
can have varying granularity, from instructions to functions. Once a node is
activated and the nodal operation is performed, this is called "Fired Results",
which are passed along the arc to waiting node. This process is repeated until
all of the nodes are fired and the final result is created. More than one node
can also be fired simultaneously. Arithmetic operators and conditional
operators act as nodes.
The Dynamic Critical Path: The dynamic critical path of dataflow graph is
simultaneously a function of program dependences, runtime execution path,
hardware resources and dynamic scheduling. All critical events must be last
arrival events. Such an event is the last one, which enables data to be latched.
Events correspond to signal transitions on the edges of the dataflow graphs.
Most often, the last-arrival event is the last input to reach an operation.
However, for moderate operations the last arrival event is the input that
enables the computation of the output. In lenient execution, all forward
branches are executed simultaneously.
• *T (Star T): The monsoon project at MIT was followed by the Star-T
project. It used extension of the off-the-shelf process or architecture to
define a multiprocessor architecture using to support fine-grain
communication and set up user micro-threads. The architecture is
projected to hold the latency-hiding feature of the Monsoon split-phase
global memory operations while being compatible with conventional
Massively Parallel Architecture (MPA's) based on Von Neumann model.
Inter-node traffic consists of a tag (called continuation, a pair comprising a
context and instruction pointer).
13.9 Summary
• Multithreading is needed to create an application that is able to perform
more than one task at once.
• There are various needs, benefits and principles of multithreading.
• Scalability is defined as the ability to increase the amount of processing
that can be done by adding more resources to a system.
• The combination of languages and the computer architecture in a common
foundation or paradigm is called Computational Model.
• There are basically three types of computational models as follows: Von
Neumann model, Dataflow model and Hybrid multithreaded model.
• The heart of the Von Neumann computer architecture is the Central
Processing Unit (CPU), consisting of the control unit and the ALU
(Arithmetic and Logic Unit).
• Dataflow architecture does not have a program counter and the execution
of instructions is solely determined based on the availability of input
arguments to the instructions.
• A hybrid model synergistically combines features of both Von Neumann
and Data-flow, as well as exposes parallelism at a desired level.
13.10 Glossary
• Background processing: Another name for the multithreading system
which has a concept of priorities.
• Communication parallelism: refers to the way threads can
communicate with other threads residing in other processors.
• Computation parallelism: refers to the 'conventional' parallelism
• GUI: Graphical User Interface, these programs can perform more than one
task (such as editing a document as well as printing it) at a time.
• JVM: Java Virtual Machine, it is an example of multithreading system.
13.12 Answers
Self Assessment Questions
1. Background processing, pre-emptive multitasking
2. False
3. True
4. Fine-grain threading
5. Scalable
6. Thread
7. Computational Model
8. True
9. Instruction set architecture (ISA)
10. True
11. Node, arrow
12. Fired Results
13. Iannucci
14. False
Terminal Questions
1. Multithreading is a type of multitasking. The multithreading system has a
concept of priorities and therefore, is also called background processing
or pre-emptive multitasking. Refer Section 13.2.