0% found this document useful (0 votes)

81 views

Computer Performance

This document discusses parallel processing in uniprocessor computer systems. It describes four levels of parallel processing from the job level to the intrainstruction level. It also discusses several hardware and software mechanisms used in uniprocessors to enable parallelism, including: multiplicity of functional units, parallelism within the CPU via pipelining, overlapped CPU and I/O operations, hierarchical memory systems, balancing subsystem bandwidths, and multiprogramming.

Uploaded by

motaz

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views

Computer Performance

Uploaded by

motaz

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

‫ميحرلا نمحرلا هللا مسب‬

UNIVERSITY OF BAHRI
COLLAGE OF ENGINEERING AND ARCHITECTURE

ELECTERICAL ENGINEERING (control)

5th year – 9t h semester

Parallelism & computer performance

Present by :

Group A

Supervisor :

Dr. Zeinab Mahmoud

1st dec 2016

group members
Rayyan mohammed siddig
Suha Mohammed Saad
Nusiba Ahmed Mahmmed
Shayma Ali Abdallah
Gund Alyemen Merghani
Mona Alsir Ibrahim Gabir
Nusiuba Ibrahim Mohammed
Abdallah Ibrahim Abdallah
Abdalrahman ahmed Osman
Aatif Osman El tahir Bkr

1
 Parallel processing
Parallel processing has emerged as a key technology in modern computers
driven by the increasing demand for higher performance, lower costs and
good productivity in real-life applications.

Current events are taking place in today’s high performance computers due
to the common practice of multiprogramming, multiprocessing, or
multicomputing

From an operating system point of view, computer systems have improved

chronologically in four phases:

 Batch processing
 Multiprogramming
 Time sharing
 Multiprocessing

In these four operating modes, the degree of parallelism increases sharply

from phase to phase. The general trend is to emphasize parallel processing of
information. In what follows, the term information is used with an extended
meaning to include data, information, knowledge, and intelligence. We
formally define parallel processing as follows:

Definition

[1] Parallel processing is an efficient form of information processing which

emphasizes the exploitation of concurrent events in the computing process.

Concurrency implies parallelism, simultaneity, and pipelining. Parallel events

may occur in multiple resources during the same time interval; simultaneous
events may occur at the same time instant; and pipelined events may occur in
overlapped time spans. These concurrent events are attainable in a computer
system at various processing levels. Parallel processing demands

concurrent execution of many programs in the computer. It is in contrast to

sequential processing. It is a cost-effective means to improve system
performance through concurrent activities in the computer.

2
Parallel processing can be performed in four programmatic levels:

 Job or program level

This is highest level of parallel processing and it is conducted among multiple

jobs or programs through multiprogramming, time sharing, and
multiprocessing. This level requires the development of parallel processable
algorithms. The implementation of parallel algorithms depends on the
efficient allocation of limited hardware-software resources to multiple
programs being used to solve a large computation problem.

 Task or procedure level

This is the next highest level of parallel processing & is conducted among
procedures or tasks (program segments) within the same program. This
involves the decomposition of a program into multiple tasks.

 Instruction level

This is the third level is to exploit concurrency among multiple instructions.

Data dependency analysis is often performed to reveal parallelism among
instructions. Vectorization may be desired among scalar operations within
DO loops.

 Intrainstruction level

Finally, we may wish to have faster and concurrent operations within each
instruction.

The highest job level is often conducted algorithmically. The lowes t

intrainstruction level is often implemented directly by hardware means.
Hardware roles increase from high to low levels. On the other hand, software
implementations increase from low to high levels. The trade-off between
hardware and software approaches to solve a problem is always a very
controversial issue. As hardware cost declines and software cost increases,
more and more hardware methods are replacing the conventional software
approaches. The trend is also supported by the increasing demand for a
faster real-time, resource-sharing, and fault-tolerant computing environment.

3
 Parallelism In Uniprocessor System
Most general-purpose uniprocessor systems have the same basic structure.

Basic Uniprocessor Architecture

A typical uniprocessor computer consists of three major components: the

main memory, the central processing unit (CPU), and the input-output (I/O)
subsystem.

The architectures of two commercially available uniprocessor computers are

given in Fig.1. below to show the possible interconnection of structures
among the three subsystems. We will examine major components in the CPU
and in the I/O subsystem.

Fig. 1. The system architecture of the supermini VAX-11/780 uniprocessor

system (Courtesy of Digital Equipment Corporation).

4
Fig 1. shows the architectural components of the super minicomputer VAX-
11/780, manufactured by Digital Equipment Company. The CPU contains the
master controller of the VAX system. There are sixteen 32 -bit general-purpose
registers, one of which serves as the program counter (PC). There is also a
special CPU status register containing information about the current state of
the processor and of the program being executed. The CPU contains an
arithmetic and logic unit (ALU) with an optional floating-point accelerator,
and some local cache memory with an optional diagnostic memory. The CPU
can be intervened by the operator through the console connected to a floppy
disk.

The CPU, the main memory (232 words of 32 bits each), and the I/O
subsystems are all connected to a common bus, the synchronous backplane
interconnect (SBI). Through this bus, all I/O devices can communicate with
each other, with the CPU, or with the memory. Peripheral storage or I/O
devices can be connected directly to the SBI through the unibus and its
controller (which can be connected to PDP-11 series minicomputers), or
through a massbus and it controller.

 Hardware and software means to promote parallelism

in uniprocessor systems
hardware approaches emphasize resource multiplicity and time overlapping.
It is necessary to balance the processing rates of various subsystems in order
to avoid bottlenecks and to increase total system throughput, which is the
number of instructions (or basic computations) performed per unit time.
Finally, we study operating system software approaches to achieve parallel
processing with better utilization of the system resources.

Parallel Processing Mechanisms in uniprocessor Computers

A number of parallel processing mechanisms have been developed in

uniprocessor computers. We identify them in the following six categories:

 Multiplicity of functional units

 Parallelism and pipelining within the CPU
 Overlapped CPU and I/O operations
 Use of a hierarchical memory system
 Balancing of subsystem bandwidths
 Multiprogramming and time sharing

5
1- Multiplicity of functional units

The early computer had only one arithmetic and logic unit in its CPU.
Furthermore, the ALU could only perform one function at a time, a rather
slow process for executing a long sequence of arithmetic logic instructions. In
practice, many of the functions of the ALU can be distributed to multiple and
specialized functional units which can operate in parallel. The CDC-6600
(designed in 1964) has 10 functional units built into its CPU (Fig. 2.). These 10
units are independent of each other and may operate simultaneously. A
scoreboard is used to keep track of the availability of the functional units and
registers being demanded. With 10 functional units and 24 registers available,
the instruction issue rate can be significantly increased.

Fig. 2. The system architecture of the CDC-6600 computer (Courtesy of

Control Data Corp.).

Another good example of a multifunction uniprocessor is the IBM 360/91

(1968), which has two parallel execution units (E units): one for fixed-point
arithmetic, and the other for floating-point arithmetic. Within the floating-
point E unit are two functional units: one for floating-point add-subtract and
the other for floating-point multiply-divide. IBM 360/91 is a highly pipelined,
multifunction, scientific uniprocessor.

6
Almost all modern computers and attached processors are equipped with
multiple functional units to perform parallel or simultaneous arithmetic logic
operations. This practice of functional specialization and distribution can be
extended to array processors and multiprocessors,

2- Parallelism and pipelining within the CPU Parallel adders

using such techniques as carry-lookahead and carry-save, are now built into
almost all ALUs. This is in contrast to the bit-serial adders used in the first-
generation machines. High-speed multiplier recoding and convergence
division are techniques for exploring parallelism and the sharing of hardware
resources for the functions of multiply and. The use of multiple functional
units is a form of parallelism with the CPU.

Various phases of instruction executions are now pipelined, including

instruction fetch, decode, operand fetch, arithmetic logic execution, and store
result. To facilitate overlapped instruction executions through the pipe,
instruction prefetch and data buffering techniques have been developed.
Instruction and arithmetic pipeline designs will be covered later . Most
commercial uniprocessor systems are now pipelined in their CPU with a clock
rate between 10 and 500 ns.

3- Overlapped CPU and I/O operations I/O operations

This can be performed simultaneously with the CPU computations by using

separate I/O controllers, channels, or I/O processors. The direct-memory-
access (DMA) channel can be used to provide direct information transfer
between the I/O devices and the main memory.

The DMA is conducted on a cycle-stealing basis, which is apparent to the

CPU. Furthermore, I/O multiprocessing, such as the use of the 10 I/O
processors in CDC-6600 (Fig. 2.), can speed up data transfer between the CPU
(or memory) and the outside world. I/O subsystems .

Back-end database machines can be used to manage large databases stored on

disks.

4- Use of hierarchical memory system

Usually, the CPU is about 1000 times faster than memory access. A
hierarchical memory system can be used to close up the speed gap. Computer

7
memory hierarchy is conceptually illustrated in fig 3. The innermost level is
the register files directly addressable by ALU. Cache memory can be used to
serve as a buffer between the CPU and the main memory. Block access of the
main memory can be achieved through multiway interleaving across parallel
memory modules . Virtual memory space can be established with the use of
disks and tape units at the outer levels.

Fig 3. The classical memory hierarchy

5- Multiprogramming and Time Sharing

Even when there is only one CPU in a uniprocessor system, we can still
achieve a high degree of resource sharing among many user programs. We
will briefly review the concepts of multiprogramming and time sharing in this
subsection. These are software approaches to achieve concurrency in a
uniprocessor system.

8
 Classification of computers by parallelism
Four computers architectural classification schemes are presented in this
section based on the multiplicity of instruction streams and data streams in a
computer system.

Computer organizations are characterized by the multiplicity of the hardware

provided to service the instruction and data streams. Listed below are Flynn's
four machine organizations:

 Single instruction stream-single data stream (SISD)

 Single instruction stream-multiple data stream (SIMD)
 Multiple instruction stream-single data stream (MISD)
 Multiple instruction stream-multiple data stream (MIMD)

These organizational classes are illustrated by the block diagrams in Figure 4.

The categorization depends on the multiplicity of simultaneous events in the
system components. Conceptually, only three types of system components are
needed in the illustration. Both instructions and data are fetched from the
memory modules. Instructions are decoded by the control unit, which sends
the decoded instruction stream to the processor units for execution. Data
streams flow between the processors and the memory bidirectionally.
Multiple memory modules may be used in the shared memory subsystem.
Each instruction stream is generated by an independent control unit. Multiple
data streams originate from the subsystem of shared memory modules. I/O
facilities are not shown in these simplified block diagrams.

SISD computer organization

This organization, shown in fig 4a , represents most serial computers available

today. Instructions are executed sequentially but may be overlapped in their
execution stages (pipelining). Most SISD uniprocessor systems are pipelined.
An SISD computer may have more than one functional unit in it. All the
functional units are under the supervision of one control unit.

SIMD computer organization

This class corresponds to array processors, . As illustrated in fig 4b, there are
multiple processing elements supervised by the same control unit. All PEs
receive the same instruction broadcast from the control unit but operate on

9
different data sets from distinct data streams. The shared memory subsystem
may contain multiple modules.

MISD computer organization

This organization is conceptually illustrated in fig 4c. There are n processor

units, each receiving distinct instructions operating over the same data stream
and its derivatives. The results (output) of one processor become the input
(operands) of the next processor in the macropipe. This structure has received
much less attention and has been challenged as impractical by some
computer architects. No real embodiment of this class exists.

MIMD computer organization

Most multiprocessor systems and multiple computer systems can be classified

in this category fig 4d. An intrinsic MIMD computer implies interactions
among the n processors because all memory streams are derived from the
same data space shared by all processors. If the n data streams were derived
from disjointed subspaces of the shared memories, then we would have the
so-called multiple SISD (MSISD) operation, which is nothing but a set of n
independent SISD uniprocessor systems.

10
Fig 4. Flynn's classification of various computer organizations.

11
Introduction to performance models
Designers evaluate the performance of computers and the effect of design
changes to a computer or subsystem.

In general, there are two common techniques for performing these

evaluations: simulation and analytic models. Simulation has a major
disadvantage of hiding the effects of workload or architectural parameters
that are the input to the simulator. The simulation produces some single
measure of performance but the underlying basis for the number is obscure.
On the other hand, analytic models must explicitly comprehend each of the
workload and architectural parameters.

Because of the small time required for obtaining a solution as parameters

vary, the effect of these parameters can be evaluated and understood.
However, these models do not generally comprehend concurrency and are
subject to significant error. [2]

 Computer Performance
For a computer or a computer subsystem , when it comes to measures of
performance that can be used by a designer when making design choices we
are generally interested in two things: the time to do tasks and the rate at
which given tasks are performed. For example,

Task A executes in 3 µs per task,

Task A executes at a rate of 3.33 x 105 executions per second

Note that for a common work load (Task A), time and rate have a reciprocal
relationship. Because a computer has a clock that controls all of its functions,
the number of clocks is frequently used as a measure of time.

If the clock period of the processor is 50 ns,

300 clocks per execution of Task A,

Task A executes at a rate of 0.0033 tasks per clock

12
The factors that affecting the performance of a computer

1- clock rate

the cpu of a computer is driven by a clock with a constant frequency f (the

clock rate ) the clock cycle time is given by
𝟏
clock cycle time 𝑻 =
𝑭

the number of clock cycle is frequently used as a measure of time

Example

Take the synchronizer Below . Assume that 15 ns is deemed to be a sufficient

clock delay to eliminate most, if not all, metastable conditions on flip flop 1.
What is the minimum clock period and maximum clock rate of this system?

Solution :

minimum clock period = 15 ns,

1
maximum clock rate = 15 𝑛𝑠 = 66.6 MHz.

2- program size IC

the number of instruction to be executed in the program to the instructions

count in IC . the higher the number of instruction, the higher is the speedup
factor using pipelining

3- Clocks per instruction (CPI)

A basic time model that is widely used in evaluating processors is clocks per
instruction (CPI). Which is the number of cycle required to execute an average
instruction

13
For example A task that takes 1 x 106 clocks to execute 5 x 105 instructions has
a CPI of 2.0. Small values of CPI indicate higher performance than large
values of CPI. For many processors, different instructions require a different
number of clocks for execution; thus CPI is an average value for all
instructions executed. Further, different programs use instructions in different
mixes. CPI combines the weighted use of each instruction class (the fraction of
the total instructions) with the number of clocks per instruction class to give a
weighted mean,

The reciprocal of CPI is instructions per clock (IPC). IPC is a measure of

processing rate and is a useful measure of performance in some situations.

So if we have program that consisting of different type i of instructionsm, to

calculate the effective CPI we use the following equation

Where

ICi = is the number of instructions for a given instruction type i

CCi = is the clock-cycles for that instruction type i

IC = is the total instruction count.

Explanation

Let us assume a classic RISC pipeline, with the following 5 stages:

1. Instruction fetch cycle (IF).

2. Instruction decode/Register fetch cycle (ID).
3. Execution/Effective address cycle (EX).
4. Memory access (MEM).
5. Write-back cycle (WB).

Each stage requires one clock cycle and an instruction passes through the
stages sequentially. Without pipelining, a new instruction is fetched in stage 1
only after the previous instruction finishes at stage 5, therefore the number of
clock cycles it takes to execute an instruction is 5 (CPI = 5 > 1). In this case, the
processor is said to be subscalar. With pipelining, a new instruction is fetched
every clock cycle by exploiting instruction-level parallelism, therefore, since
one could theoretically have 5 instructions in the 5 pipeline stages at once
(one instruction per stage), a different instruction would complete stage 5 in

14
every clock cycle and on average the number of clock cycles it takes to execute
an instruction is 1 (CPI = 1). In this case, the processor is said to be scalar.

Example

[3] For the multi-cycle MIPS, there are 5 types of instructions:

 Load (5 cycles)
 Store (4 cycles)
 R-type (4 cycles)
 Branch (3 cycles)
 Jump (3 cycles)

If a program has:

 50% load instructions

 15% R-type instructions
 25% store instructions
 8% branch instructions
 2% jump instructions

then, the CPI is:

4- Millions of instructions per second (MIPS)

The processor speed is often measured in terms of MIPS .because MIPS can
be useful when comparing performance between processors made with
similar architecture (e.g. Microchip branded microcontrollers), They are
difficult to compare between differing CPU architectures

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜 𝑜𝑓 𝑡𝑕𝑒 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑖𝑛 𝑚𝑖𝑙𝑙𝑖𝑜𝑛𝑠

MIPS rate =
𝑡𝑖𝑚𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑡𝑜 𝑒𝑥𝑐𝑢𝑡𝑒 𝑡𝑕𝑒 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑇

instruction time
𝑀𝐼𝑃𝑆 =
exctution time × 106
𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚
𝑐𝑙𝑜𝑐𝑘 𝑟𝑎𝑡𝑒

15
then

𝐂𝐥𝐨𝐜𝐤 𝐫𝐚𝐭𝐞
𝑴𝑰𝑷𝑺 =
(𝐂𝐏𝐈 𝐱 𝟏𝟎𝟔)

Example
[3] A 400-MHz processor was used to execute a benchmark program with the

following instruction mix and clock cycle count:

Determine the effective CPI, MIPS rate, and execution time for this program.

Solution

Total instruction count = 100000

therefore

Executing time (T) = CPI x Instruction count x clock time =

5- million floating-point operations per second (MFLOPS)

Some times MIPS can fail to give a true picture of performance in that it does
not track execution time.
So Another popular alternative to measure execution time is million floating-
point operations per second is used . The formula for MFLOPS is simply

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑙𝑜𝑎𝑡𝑖𝑛𝑔 − 𝑝𝑜𝑖𝑛𝑡 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑎 𝑝𝑟𝑜𝑔𝑟𝑎𝑚

𝑀𝐹𝐿𝑂𝑃𝑆 =
Execution time × 106

16
 Speedup
Designers are faced with the question of evaluating the effect of modifying a
design, called design A, to design B. Is the modified design, design B, better
than design A? The answer is found by using the concept of speedup. Note
that speedup is a dimensionless ratio:

If design B is an improvement, then the speedup will be greater than 1. If

design B hurts performance, the speedup will be less than 1. If speedup is
equal to 1, there is no performance change. This relationship recognizes that
reducing CPIs is associated with improved performance because the number
of clocks is a measure of time and an instruction is a measure of the work
performed.

 Means and Weighted Means

With computers, we can make two types of measurements or observations:
(1) the time needed to perform a task, (2) the rate at which a task is
performed,

There are three types of means used to find the central tendency of the
measurements: arithmetic mean for time, and harmonic mean for rates,

The measurement and observations for these three means may have equal
weights or be weighted.

In many of the subsequent discussions of weighted means, the term weight

will be used to refer to the fractional occurrence of an event, in which the
fraction is x/100, or to the frequency of an event, in which the frequency is
also given as x/100. There will be cases in which the context of the model
implies the use of the terms frequency of use or fraction of time rather than
weight.

17
Time-Based Means
Smith (Smith 1988) states that "the time required to perform a specific amount
of computation is the ultimate measure of computer performance." Thus time
measurements are usually fundamental measurements in the field of
computer performance modeling. When other measures are used, the validity
of these measures can usually be checked by converting to time

Arithmetic Mean
The arithmetic mean is used to find the central tendency for equal weight
time measurements. The arithmetic mean of the time per event is determined
by

An arithmetic mean requires that the data points have equal weights, and the
result is frequently called the average. For example, the average grade, x', in a
class is the sum of the observed grades divided by the number of students:

Example

A particular job is run on a corporate mainframe on the first of each month.

The times required for running this job over a 4-month period are 2 h, 2.2 h,
1.9 h, and 2.3 h. What is the mean or average time to run this job?

Solution

The data points have equal weight as the same job is run each month. Thus
the arithmetic mean or average is used to find the central tendency. The
mean or average time to run a job over the 4-month period is

18
Weighted Arithmetic Mean

For many cases, computing an equal-weight arithmetic mean will give

misleading results. Care must be taken when events occur at different
fractions of the total events and each event requires a different amount of
time. The weighted time per event is the weighted arithmetic mean, defined
as

The weighted arithmetic mean is the central tendency of time per unit of
work. Wi is the fraction that operation i is of the total operations, and Ti is the
time consumed by each use. Note that W1 + W2 + …+ Wn = 1 and that Wi is
not the fraction of time that the operation is in use.

Example

A processor has two classes of instructions: class A instructions take two

clocks to execute whereas class B instructions take three clocks to execute. Of
all the instructions executed, 75% are class A instructions and 25% are class B
instructions. What is the CPI of this processor?

Solution

The observations are in time and are weighted. Thus the CPI of the processor
is determined by the weighted arithmetic mean:

Note When solving a problem such as this one, add the event probabilities
together and verify that the sum is one; if the sum is not equal to one, there is
some error in the solution. A good practice is to use a table, as shown in
Table below, for the solution of these problems rather than attempt to bind
variables to an equation.

19
Rate-Based Means
Performance is sometimes measured as rates. For example, a car goes 25 MPH
or a computer performs 100 million instructions per second. Likewise, a
computer may execute 0.5 IPC, the reciprocal of CPI. Thus, instead of time, we
can also consider rates for evaluating performance or design changes. When
rates are the observed events, the harmonic mean and the weighted harmonic
mean will provide the central tendency of the observations.

Harmonic Mean
The harmonic mean is the central tendency of the observations expressed as
rates having equal weights and the result is the mean events per unit of time:

The harmonic mean is defined as the reciprocal of the arithmetic mean of the
reciprocals of the values and is the central tendency of units of work per units
of time.

Example

Consider the example in the subsection on arithmetic means of jobs being run
on the corporate computer. We express the observations in a rate measure of
jobs per hour. These data are 0.5, 0.45, 0.53, and 0.43 jobs per hour. What is
the central tendency of these measurements in jobs per hour?

Solution

The central tendency is found by using the harmonic mean:

Note that 0.476 jobs per hour is the reciprocal of 2.1 h per job found
previously with the arithmetic mean.

20
Weighted Harmonic Mean
When the observed rate data are weighted, the mean Becomes

where Wi is the fraction of the total task, not time, that is performed at Ri. The
result is the central tendency in weighted events per units of time.

Example

A program executes in two modes: 40% of the program instructions execute at

the rate of 10 million instructions per second and 60% execute at the rate of 5
million instructions per second. What is the weighted execution rate in
millions of instructions per second?

Solution

Because the observations are rates and the executions are weighted, we use
the weighted harmonic mean to find the central tendency, shown in Table
below

Example

Solve the same problem by using times in seconds per instruction, rather than
rates.

Solution

Because the data are in time and the executions are weighted, we use the
weighted arithmetic mean as shown in Table below

21
Note The two results have a reciprocal relationship, as expected:

 AMDAHL'S LAW
This law models the speedup of a computer when it is processing two classes
of tasks; one class can be speeded up whereas the other class cannot . This
situation is frequently encountered in computer system design. The solution
of Amdahl's law model is normalized to the execution time of the system
before any speedup is applied.

Assume that a program has two components, tl and t2. The component t2 can
be speeded up. So the overall speedup of the system is

And for a fixed load (some program) the speedup is defined as

𝑠𝑒𝑞𝑢𝑒𝑛𝑡𝑖𝑎𝑙 𝑒𝑥𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒

𝑠𝑝𝑒𝑒𝑑𝑢𝑝 𝑠 𝑛 =
𝑒𝑥𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 (𝑠𝑒𝑞𝑢𝑒𝑛𝑡𝑖𝑎𝑙 𝑝𝑎𝑟𝑡 + 𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 𝑝𝑎𝑟𝑡)

𝑇𝑠
𝑠 𝑛 =
𝑎 𝑇𝑆 + 1 − 𝑎 𝑇𝑠 /𝑛

Where

TS = execution time for sequential processing of the whole task with processor

𝑎 = fraction of execution time for sequential part of the program

n = number of parallel process

we can rewrite the above equation to be as follow

1
𝑠 𝑛 =
𝑎 + 1 − 𝑎 /𝑛

Multiplying num and den by n we get

𝒏
𝒔 𝒏 =
𝟏 + 𝒏−𝟏 𝒂

22
We consider these limits.

As a → 0 (in other words, the complete program can be speeded up & there is
no sequential load), speedup 𝑠(𝑛) → n

and as n → ∞ (in other words, t2 is reduced to zero), speedup s(n) → 1/a.

This second limit tells us that if 10% of the time of the original system cannot
be speeded up, then the greatest speedup possible for the system is 10.

The best speedup is upper-limited by 1/a , regardless of how many

processors are employed. I.e. the sequential portion of the program does not
change with respect to the number of processors in the computer

Example

An executing program is timed, and it is found that the serial portion (that
portion that cannot be speeded up) consumes 30 s whereas the other portion
that can be speeded up consumes 70 s of the time. You believe that by using
parallel processors, you can speed up this later portion by a factor of 8. What
is the speedup of the system?

Solution

A graphical solution to this Amdahl's law problem is shown in Fig below. The
70-s portion of the time is decreased by a factor of 8 to 8.75 s. The speedup of
the system is found by dividing the original time (100 s) by the enhanced time
(38.75 s). Thus speedup is 2.58.

The same result can be found by binding the arguments to Amdahl's law. For
the original system, a = 0.3, the enhancement is n = 8. Thus the speedup is
calculated as

23
 Relative vector/scalar performance
Let W = total amount of work done in a computer of which :

WS = work done in scalar execution (one processor)

WV = work done in vector (parallel or many processor)

W = WS + W V

Let RS and RV be scalar and vector execution ratio respectively in MIPS or

MFLOPS

Let α be the vectorization ratio ( α is the % age of code executed in parallel )

𝐖𝐕
Then 𝛂= 𝐖𝐒+ 𝐖𝐕

Let r be the vector/scalar speed ratio

𝐑𝐯
Then 𝐫 = 𝐑𝐬

If only scalar execution is used (vector mode not used at all) then
𝐖𝐬 + 𝐖𝐯
Scalar execution time (sequential) 𝐓𝐬 = 𝐑𝐬

After Combined vector and scalar execution time ( sequential + parallel) we

get
𝐖𝐬 𝐖𝐯
𝐓𝐜𝐨𝐦 = +
𝐑𝐬 𝐑𝐯

Divide 𝑇𝑠 by 𝑇𝑐𝑜𝑚 We get

𝐖𝐬 + 𝐖𝐯
𝐓𝐬 𝐑𝐬
Relative speed 𝐒 = = 𝐖𝐬 𝐑 𝐯 +𝐖𝐯𝐑 𝐬
𝐓𝐜𝐨𝐦
𝐑𝐬 𝐑𝐯

𝐖𝐬 + 𝐖𝐯
𝐑𝐬 = 𝐖𝐬 + 𝐖𝐯
𝐖𝐯 = 𝐫 𝐫𝐖𝐬 𝐖𝐯
𝐖𝐬 + 𝐖𝐯 𝐑 𝐖𝐬 + 𝐫
+
𝐯 𝐖𝐬 𝐖𝐯 𝐖𝐬 𝐖𝐯

Now substitute the vectorization ration by its value we get

𝒓 𝒓 𝟏
= = 𝜶
𝒓 𝟏−𝜶 + 𝜶 𝟏 −𝜶 𝒓+𝜶 𝟏 −𝜶 + 𝒓

Which is Amdahl’s law , just replace [𝛼 → (1 − 𝑟) ] & (r→ 𝑛 )

24
 Interleaved Memory Organization
In order to close up the speed gap between the CPU/cache and main memory
built with RAM modules, an interleaving technique is used below which
allows pipelined access of the parallel memory modules.

The memory design goal is to broaden the effective memory bandwidth so

that more memory words can be accessed per unit time. The ultimate purpose
is to match the memory bandwidth with the bus bandwidth and with the
processor bandwidth. [4]

Memory modules

The main memory is built with multiple modules. These memory modules
are Connected to a system bus or a switching network to which other
resources such as processors or I/O devices are also connected.

Once presented with a memory address, each memory module returns with
one word per cycle. It is possible to present different addresses to different
memory modules so that parallel access of multiple words can be done
simultaneously or in a pipelined fashion. Both parallel access and pipelined
access are forms of parallelism practiced in a parallel memory organization.

Consider a main memory formed with m = 2 a memory modules (“a” is

number of bits to access a module), each containing w = 2 b words of memory
cells (“b” is number of bits required to access a word with the module
”offset”).

The total memory capacity is m . w = 2a+b words. These memory words are
assigned linear addresses.

Different ways of assigning linear addresses result in different memory

organizations.

memory interleaving

Memory interleaving is the technique used to increase the throughput. The

core idea is to split the memory system into independent banks, which can
answer read or write requests independents in parallel.

25
there are two address formats for memory interleaving. High-order
interleaving and low-order interleaving

low-order interleaving spreads contiguous memory locations across the m

modules horizontally (Fig. 5.a]. This implies that the low-order a bits of the
memory address are used to identify the memory module. The high-order b
bits are the word addresses (displacement) within each module. Note that the
same word address is applied to all memory modules simultaneously. A
module address decoder is used to distribute module addresses.

High-order low-order interleaving (Fig. 5.b) uses the high-order a bits as the
module address and the low-order b bits as the word address within each
module. Contiguous memory locations are thus assigned to the same memory
module. in each memory cycle, only one word is accessed from each module.
Thus the high-order interleaving cannot support block access of contiguous
locations.

26
Fig.5. two interleaved memory organizations with m = 2 a modules

and w = 2b words per module {word address shown in boxes]

On the other hand, the low-order m-way interleaving does support block
access in a pipelined fashion
Unless otherwise specified, we consider only low-order memory interleaving
in subsequent discussion

27
 S-Access memory Organization
The low-order interleaved memory can be rearranged to allow simultaneous
access, or S-access , as illustrated in Fig. 6a. In this case, all memory modules
are accessed simultaneously in a synchronized manner. Again the high-order
(n — a) bits select the same offset word from each module.
At the and of each memory cycle [Fig. 6], m = 2a consecutive words are
latched in the data buffers simultaneously. The low-order a bits are then used
to multiplex the m words out, one per each minor cycle.
If the minor cycle is chosen to be 1/m of the major memory cycle , then it
takes two memory cycles to access m consecutive words.
However, if the access phase of the last access is overlapped with the fetch
phase of the current access (Fig. 6 b), effectively m words take only one
memory cycle to access. If the stride is greater than 1 , then the throughput
decreases, roughly proportionally to the stride.

28
Fig. 6 The S-access interleaved memory for vector operand access

 Pipelined memory Access

Access of the m memory modules can be overlapped in a pipelined fashion.
For this purpose. the memory cycle (called the major cycle ) is subdivided into
m miner cycles

29
Fig. 7 Multiway interleaved memory organization and the C-access timing
chart

An 8 way interleaved memory [with m = 8 and w = 8 and thus a = b = 3] is

shown in Fig. 7. Let Ө be the major cycle and t be the minor cycle. These two
cycle times are related as follows:

Ө
𝑡=
𝑚

30
where m is the degree of interleaving. The timing of the pipelined access of
the eight contiguous memory words is shown in Fig. 5.b. This type of
concurrent access of contiguous words has been called a C-access memory
scheme. The major cycle Ө is the total time required to complete the access of
a single word from a module. The minor cycle t is the actual time needed to
produce one word, assuming overlapped access of successive memory
modules separated in every minor cycle t.

Note that the pipelined access of the block of eight contiguous words is
sandwiched between other pipelined block accesses before and after the
present block. Even though the total block access time is 2 Ө , the effective
access time of each word is reduced to t as the memory is contiguously
accessed in a pipelined fashion.

 Bandwidth and Fault Tolerance

Hellerman (1967) has derived an equation to estimate the effective increase in
memory bandwidth through multiway interleaving. A single memory
module is assumed to deliver one word per memory cycle and thus has a
bandwidth of 1.

Memory Bandwidth
The memory bandwidth B of an m-way interleaved memory is upper-
bounded by m and lower-bounded by 1. The Hellerman estimate of B is

where m is the number of interleaved memory modules. This equation

implies that if 16 memory modules are used, then the effective memory
bandwidth is approximately four times that of a single module.

This pessimistic estimate is due to the fact that block access of various lengths
and access of single words are randomly mixed in user programs.
Hellerman's estimate was based on a single-processor system.

If memory-access conflicts from multiple processors (such as the hot spot

problem) are considered, the effective memory bandwidth will be further
reduced.

31
ln a vector processing computer, the access time of a long vector with n
elements and stride distance I has been estimated by Cragon (1992) as
follows: It is assumed that the n elements are stored in contiguous memory
locations in an or-way interleaved memory system. The average time t1 ,
required to access one element in a vector is estimated by

When 𝑛 → ∞ (very long vector), t1 → Ө/m = t . As n → 1 [scalar access), t1

→ Ө.

Above Equation conveys t1're message that interleaved memory appeals to

pipelined access of long vectors; the longer the better.

Fault Tolerance

High- and low-order interleaving can be combined to yield many different

interleaved memory organizations. Sequential addresses are assigned in the
high-order interleaved memory in each memory module.

This makes it easier to isolate faulty memory modules in a memory bank of m

memory modules. When one module failure is detected, the remaining
modules can still be used by opening a window in the address space. This
fault isolation cannot be carried out in a low -order interleaved memory, in
which a module failure may paralyze the entire memory bank. Thus low-
order interleaving memory is not fault-tolerant.

Example : Memory banks, fault tolerance, and bandwidth Trade offers

In Fig. 8, two alternative memory addressing schemes are shown which

combine the high- and low-order interleaving concept. These alternatives
offers better bandwidth in ease of module failure. A four-way low- order
interleaving is organized in each of two memory banks in Fig. 8 a.

32
Fig. 8 Bandwidth analysis of two iterative memory organization over eight
memory modules (absolute address shown in each memory bunk.)

in the other hand, two-way low-order interleaving is depicted in Fig. 8 b with

the memory system divided into four memory banks. The high-order bits are
used to identify the memory banks. The low-order bits are used to address the
modules for memory interleaving.
In case of single module failure, the maximum memory bandwidth of the
eight way interleaved memory
is reduced to zero because the entire in-emery bank must be abandoned. For
the four-way two bank design (Fig. 8 a), the maximum bandwidth is reduced

33
to four words per memory cycle because only one of the two faulty banks is
abandoned.
In the two-way design in Fig. 8 b, the gracefully degraded memory system
may still have three working memory banks; thus a maximum bandwidth of
six words is expected. The higher the degree of interleaving, the higher the
potential memory bandwidth if the system is fault-free.

34
References
[1] Kai Hwang, Faye A. Briggs-Computer Architecture and Parallel
Processing -Mcgraw-Hill College (1984)

[2] Harvey G Cragon-Computer architecture and implementation-Cambridge

University Press (2000)

[3] https://en.wikipedia.org/wiki/Cycles_per_instruction

[4] Kai Hwang, Naresh Jotwani-Advanced Computer Architecture_

Parallelism, Scalability, Programmability-Mcgraw-Hill Education (2008)

High Performance Computing - CS 3010 - MID SEM Question by Subhasis Dash With Solution
No ratings yet
High Performance Computing - CS 3010 - MID SEM Question by Subhasis Dash With Solution
12 pages
L06 Arc DR Wail c4 Pipelining
No ratings yet
L06 Arc DR Wail c4 Pipelining
168 pages
Introduction To Parallel Processing
No ratings yet
Introduction To Parallel Processing
23 pages
Introduction To Parallel Processing
No ratings yet
Introduction To Parallel Processing
47 pages
HPC Module 1
No ratings yet
HPC Module 1
24 pages
COA - Module-5
No ratings yet
COA - Module-5
35 pages
Unit 5
No ratings yet
Unit 5
66 pages
B.tech CS S8 High Performance Computing Module Notes Module 1
100% (1)
B.tech CS S8 High Performance Computing Module Notes Module 1
19 pages
Parallel Archtecture and Computing
No ratings yet
Parallel Archtecture and Computing
65 pages
Parallelism in Uni-Processor Systems and Intel 8089 Iop
No ratings yet
Parallelism in Uni-Processor Systems and Intel 8089 Iop
117 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
28 pages
Flynns
No ratings yet
Flynns
41 pages
Unit 7 - Parallel Processing Paradigm
No ratings yet
Unit 7 - Parallel Processing Paradigm
26 pages
1.parallel Processing
100% (7)
1.parallel Processing
20 pages
2.
No ratings yet
2.
25 pages
Parallelism
No ratings yet
Parallelism
22 pages
Lecture-2 Uniprocessor
No ratings yet
Lecture-2 Uniprocessor
18 pages
Parallelism in Computer Architecture
No ratings yet
Parallelism in Computer Architecture
27 pages
Chapter 1 (Parallel Computer Models)
No ratings yet
Chapter 1 (Parallel Computer Models)
20 pages
HPC-Unit-2
No ratings yet
HPC-Unit-2
72 pages
Organization CH 2
No ratings yet
Organization CH 2
102 pages
Chapter - 5 Parallel Processing
No ratings yet
Chapter - 5 Parallel Processing
117 pages
Patterson&Hennessy - (1 8)
No ratings yet
Patterson&Hennessy - (1 8)
3 pages
07 - Chapter 1 PDF
No ratings yet
07 - Chapter 1 PDF
27 pages
Hpc_unit-1 Insem Notes
No ratings yet
Hpc_unit-1 Insem Notes
76 pages
Architecture
No ratings yet
Architecture
15 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
UNIT 2 CLOUD COMPUTING - converted
No ratings yet
UNIT 2 CLOUD COMPUTING - converted
19 pages
ACA Unit. 1 Parallel Processing
No ratings yet
ACA Unit. 1 Parallel Processing
10 pages
HPC Module 1
No ratings yet
HPC Module 1
48 pages
Computer Organization: Basic Structure of Computer
No ratings yet
Computer Organization: Basic Structure of Computer
59 pages
Parallel Processing Parallel Processing
No ratings yet
Parallel Processing Parallel Processing
64 pages
COA Unit-5
No ratings yet
COA Unit-5
144 pages
COA - Unit 4
No ratings yet
COA - Unit 4
84 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
HPC-Unit-1
No ratings yet
HPC-Unit-1
65 pages
6757 - Parallel Processing Introduction
No ratings yet
6757 - Parallel Processing Introduction
27 pages
Introduction To Parallel Processing
No ratings yet
Introduction To Parallel Processing
49 pages
Lecture-2-06.01.2025
No ratings yet
Lecture-2-06.01.2025
21 pages
2.introduction To Parallel Processing
No ratings yet
2.introduction To Parallel Processing
53 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Coa Unit 04
No ratings yet
Coa Unit 04
85 pages
Chapter 6 Advanced Topics
No ratings yet
Chapter 6 Advanced Topics
14 pages
Introduction To PCA
No ratings yet
Introduction To PCA
7 pages
Computer Organization: - by Rama Krishna Thelagathoti (M.Tech CSE From IIT Madras)
No ratings yet
Computer Organization: - by Rama Krishna Thelagathoti (M.Tech CSE From IIT Madras)
118 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
Chapter1 - Basic Structure of Computers
No ratings yet
Chapter1 - Basic Structure of Computers
123 pages
Student Notes 1
No ratings yet
Student Notes 1
65 pages
CSCI 8150 Advanced Computer Architecture: Hwang, Chapter 1 Parallel Computer Models 1.1 The State of Computing
100% (2)
CSCI 8150 Advanced Computer Architecture: Hwang, Chapter 1 Parallel Computer Models 1.1 The State of Computing
37 pages
Chapter1, Chap 2 PDF
No ratings yet
Chapter1, Chap 2 PDF
86 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Notes 1
No ratings yet
Notes 1
62 pages
Unit 6 - Pipeline, Vector Processing and Multiprocessors
No ratings yet
Unit 6 - Pipeline, Vector Processing and Multiprocessors
23 pages
Computer System Organization Types of Operating Systems
No ratings yet
Computer System Organization Types of Operating Systems
47 pages
Chapter 9
No ratings yet
Chapter 9
28 pages
COMPUTER ARCHITECTURE ASSIGNMENT GROUP 3
No ratings yet
COMPUTER ARCHITECTURE ASSIGNMENT GROUP 3
15 pages
HPC Module 1
0% (1)
HPC Module 1
32 pages
Chapter1 - Basic Structure of Computers
0% (1)
Chapter1 - Basic Structure of Computers
119 pages
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Human VS computer: World wide education (WWE)
From Everand
Human VS computer: World wide education (WWE)
Mohit Jain
No ratings yet
Operating Systems Interview Questions You'll Most Likely Be Asked
From Everand
Operating Systems Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Performance
No ratings yet
Performance
42 pages
Lab Report - Assignment 1: Variables
No ratings yet
Lab Report - Assignment 1: Variables
4 pages
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
67 pages
Co Unit 4
No ratings yet
Co Unit 4
17 pages
Lecture4نظم
No ratings yet
Lecture4نظم
50 pages
02 - Parallel Programming
No ratings yet
02 - Parallel Programming
27 pages
ConcurrencyDecomposition Parallel Algorithm
No ratings yet
ConcurrencyDecomposition Parallel Algorithm
40 pages
Solution
No ratings yet
Solution
14 pages
Sof108 Computer Architecture SESSION: September 2019 TUTORIAL 2 - Quantitative Principles of Computer Design
No ratings yet
Sof108 Computer Architecture SESSION: September 2019 TUTORIAL 2 - Quantitative Principles of Computer Design
3 pages
Introduction To Computer Organization
No ratings yet
Introduction To Computer Organization
66 pages
Instructor: L. N. Bhuyan
No ratings yet
Instructor: L. N. Bhuyan
32 pages
MPI Python Workshop Day1 Fall2024
No ratings yet
MPI Python Workshop Day1 Fall2024
22 pages
Fundamental of Embedded System Design: Unit I
No ratings yet
Fundamental of Embedded System Design: Unit I
12 pages
TUT2
No ratings yet
TUT2
3 pages
Tutorial 1
100% (1)
Tutorial 1
2 pages
Computer Organization and Architecture Pipelining Set Execution, Stages and Throughput
No ratings yet
Computer Organization and Architecture Pipelining Set Execution, Stages and Throughput
7 pages
C F C P S (CS61063) : Tutorial 1
No ratings yet
C F C P S (CS61063) : Tutorial 1
13 pages
Advance Programming unit4
No ratings yet
Advance Programming unit4
149 pages
Stud CSA Mod 5p2 Arithmetic SuperPipeline
No ratings yet
Stud CSA Mod 5p2 Arithmetic SuperPipeline
57 pages
Advanced Computer Organization Itdti: I. Fixed Point Format Iii Increase in Addressing Modes and
No ratings yet
Advanced Computer Organization Itdti: I. Fixed Point Format Iii Increase in Addressing Modes and
9 pages
Quiz Questions
No ratings yet
Quiz Questions
2 pages
POAP Docking Protocol
No ratings yet
POAP Docking Protocol
35 pages
(AP CSP) (The Internet) Sequential vs. Parallel and Distributed (Student)
No ratings yet
(AP CSP) (The Internet) Sequential vs. Parallel and Distributed (Student)
2 pages
Pipeline: A Simple Implementation of A RISC Instruction Set
No ratings yet
Pipeline: A Simple Implementation of A RISC Instruction Set
16 pages
PDC Week 2 (Performance Metrice, Amdahl's Law)
No ratings yet
PDC Week 2 (Performance Metrice, Amdahl's Law)
18 pages
Week 01 Lec 2 - 05!03!2024 (Types of Parallelism)
No ratings yet
Week 01 Lec 2 - 05!03!2024 (Types of Parallelism)
17 pages
Chapter 1 Fundamentals of Computer Design
No ratings yet
Chapter 1 Fundamentals of Computer Design
40 pages
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
No ratings yet
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
58 pages