Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
81 views

Computer Performance

This document discusses parallel processing in uniprocessor computer systems. It describes four levels of parallel processing from the job level to the intrainstruction level. It also discusses several hardware and software mechanisms used in uniprocessors to enable parallelism, including: multiplicity of functional units, parallelism within the CPU via pipelining, overlapped CPU and I/O operations, hierarchical memory systems, balancing subsystem bandwidths, and multiprogramming.

Uploaded by

motaz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views

Computer Performance

This document discusses parallel processing in uniprocessor computer systems. It describes four levels of parallel processing from the job level to the intrainstruction level. It also discusses several hardware and software mechanisms used in uniprocessors to enable parallelism, including: multiplicity of functional units, parallelism within the CPU via pipelining, overlapped CPU and I/O operations, hierarchical memory systems, balancing subsystem bandwidths, and multiprogramming.

Uploaded by

motaz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

‫ميحرلا نمحرلا هللا مسب‬

UNIVERSITY OF BAHRI
COLLAGE OF ENGINEERING AND ARCHITECTURE

ELECTERICAL ENGINEERING (control)


5th year – 9t h semester

Parallelism & computer performance

Present by :

Group A

Supervisor :

Dr. Zeinab Mahmoud

1st dec 2016


group members
Rayyan mohammed siddig
Suha Mohammed Saad
Nusiba Ahmed Mahmmed
Shayma Ali Abdallah
Gund Alyemen Merghani
Mona Alsir Ibrahim Gabir
Nusiuba Ibrahim Mohammed
Abdallah Ibrahim Abdallah
Abdalrahman ahmed Osman
Aatif Osman El tahir Bkr

1
 Parallel processing
Parallel processing has emerged as a key technology in modern computers
driven by the increasing demand for higher performance, lower costs and
good productivity in real-life applications.

Current events are taking place in today’s high performance computers due
to the common practice of multiprogramming, multiprocessing, or
multicomputing

From an operating system point of view, computer systems have improved


chronologically in four phases:

 Batch processing
 Multiprogramming
 Time sharing
 Multiprocessing

In these four operating modes, the degree of parallelism increases sharply


from phase to phase. The general trend is to emphasize parallel processing of
information. In what follows, the term information is used with an extended
meaning to include data, information, knowledge, and intelligence. We
formally define parallel processing as follows:

Definition

[1] Parallel processing is an efficient form of information processing which


emphasizes the exploitation of concurrent events in the computing process.

Concurrency implies parallelism, simultaneity, and pipelining. Parallel events


may occur in multiple resources during the same time interval; simultaneous
events may occur at the same time instant; and pipelined events may occur in
overlapped time spans. These concurrent events are attainable in a computer
system at various processing levels. Parallel processing demands

concurrent execution of many programs in the computer. It is in contrast to


sequential processing. It is a cost-effective means to improve system
performance through concurrent activities in the computer.

2
Parallel processing can be performed in four programmatic levels:

 Job or program level

This is highest level of parallel processing and it is conducted among multiple


jobs or programs through multiprogramming, time sharing, and
multiprocessing. This level requires the development of parallel processable
algorithms. The implementation of parallel algorithms depends on the
efficient allocation of limited hardware-software resources to multiple
programs being used to solve a large computation problem.

 Task or procedure level

This is the next highest level of parallel processing & is conducted among
procedures or tasks (program segments) within the same program. This
involves the decomposition of a program into multiple tasks.

 Instruction level

This is the third level is to exploit concurrency among multiple instructions.


Data dependency analysis is often performed to reveal parallelism among
instructions. Vectorization may be desired among scalar operations within
DO loops.

 Intrainstruction level

Finally, we may wish to have faster and concurrent operations within each
instruction.

The highest job level is often conducted algorithmically. The lowes t


intrainstruction level is often implemented directly by hardware means.
Hardware roles increase from high to low levels. On the other hand, software
implementations increase from low to high levels. The trade-off between
hardware and software approaches to solve a problem is always a very
controversial issue. As hardware cost declines and software cost increases,
more and more hardware methods are replacing the conventional software
approaches. The trend is also supported by the increasing demand for a
faster real-time, resource-sharing, and fault-tolerant computing environment.

3
 Parallelism In Uniprocessor System
Most general-purpose uniprocessor systems have the same basic structure.

Basic Uniprocessor Architecture

A typical uniprocessor computer consists of three major components: the


main memory, the central processing unit (CPU), and the input-output (I/O)
subsystem.

The architectures of two commercially available uniprocessor computers are


given in Fig.1. below to show the possible interconnection of structures
among the three subsystems. We will examine major components in the CPU
and in the I/O subsystem.

Fig. 1. The system architecture of the supermini VAX-11/780 uniprocessor


system (Courtesy of Digital Equipment Corporation).

4
Fig 1. shows the architectural components of the super minicomputer VAX-
11/780, manufactured by Digital Equipment Company. The CPU contains the
master controller of the VAX system. There are sixteen 32 -bit general-purpose
registers, one of which serves as the program counter (PC). There is also a
special CPU status register containing information about the current state of
the processor and of the program being executed. The CPU contains an
arithmetic and logic unit (ALU) with an optional floating-point accelerator,
and some local cache memory with an optional diagnostic memory. The CPU
can be intervened by the operator through the console connected to a floppy
disk.

The CPU, the main memory (232 words of 32 bits each), and the I/O
subsystems are all connected to a common bus, the synchronous backplane
interconnect (SBI). Through this bus, all I/O devices can communicate with
each other, with the CPU, or with the memory. Peripheral storage or I/O
devices can be connected directly to the SBI through the unibus and its
controller (which can be connected to PDP-11 series minicomputers), or
through a massbus and it controller.

 Hardware and software means to promote parallelism


in uniprocessor systems
hardware approaches emphasize resource multiplicity and time overlapping.
It is necessary to balance the processing rates of various subsystems in order
to avoid bottlenecks and to increase total system throughput, which is the
number of instructions (or basic computations) performed per unit time.
Finally, we study operating system software approaches to achieve parallel
processing with better utilization of the system resources.

Parallel Processing Mechanisms in uniprocessor Computers

A number of parallel processing mechanisms have been developed in


uniprocessor computers. We identify them in the following six categories:

 Multiplicity of functional units


 Parallelism and pipelining within the CPU
 Overlapped CPU and I/O operations
 Use of a hierarchical memory system
 Balancing of subsystem bandwidths
 Multiprogramming and time sharing

5
1- Multiplicity of functional units

The early computer had only one arithmetic and logic unit in its CPU.
Furthermore, the ALU could only perform one function at a time, a rather
slow process for executing a long sequence of arithmetic logic instructions. In
practice, many of the functions of the ALU can be distributed to multiple and
specialized functional units which can operate in parallel. The CDC-6600
(designed in 1964) has 10 functional units built into its CPU (Fig. 2.). These 10
units are independent of each other and may operate simultaneously. A
scoreboard is used to keep track of the availability of the functional units and
registers being demanded. With 10 functional units and 24 registers available,
the instruction issue rate can be significantly increased.

Fig. 2. The system architecture of the CDC-6600 computer (Courtesy of


Control Data Corp.).

Another good example of a multifunction uniprocessor is the IBM 360/91


(1968), which has two parallel execution units (E units): one for fixed-point
arithmetic, and the other for floating-point arithmetic. Within the floating-
point E unit are two functional units: one for floating-point add-subtract and
the other for floating-point multiply-divide. IBM 360/91 is a highly pipelined,
multifunction, scientific uniprocessor.

6
Almost all modern computers and attached processors are equipped with
multiple functional units to perform parallel or simultaneous arithmetic logic
operations. This practice of functional specialization and distribution can be
extended to array processors and multiprocessors,

2- Parallelism and pipelining within the CPU Parallel adders

using such techniques as carry-lookahead and carry-save, are now built into
almost all ALUs. This is in contrast to the bit-serial adders used in the first-
generation machines. High-speed multiplier recoding and convergence
division are techniques for exploring parallelism and the sharing of hardware
resources for the functions of multiply and. The use of multiple functional
units is a form of parallelism with the CPU.

Various phases of instruction executions are now pipelined, including


instruction fetch, decode, operand fetch, arithmetic logic execution, and store
result. To facilitate overlapped instruction executions through the pipe,
instruction prefetch and data buffering techniques have been developed.
Instruction and arithmetic pipeline designs will be covered later . Most
commercial uniprocessor systems are now pipelined in their CPU with a clock
rate between 10 and 500 ns.

3- Overlapped CPU and I/O operations I/O operations

This can be performed simultaneously with the CPU computations by using


separate I/O controllers, channels, or I/O processors. The direct-memory-
access (DMA) channel can be used to provide direct information transfer
between the I/O devices and the main memory.

The DMA is conducted on a cycle-stealing basis, which is apparent to the


CPU. Furthermore, I/O multiprocessing, such as the use of the 10 I/O
processors in CDC-6600 (Fig. 2.), can speed up data transfer between the CPU
(or memory) and the outside world. I/O subsystems .

Back-end database machines can be used to manage large databases stored on


disks.

4- Use of hierarchical memory system

Usually, the CPU is about 1000 times faster than memory access. A
hierarchical memory system can be used to close up the speed gap. Computer

7
memory hierarchy is conceptually illustrated in fig 3. The innermost level is
the register files directly addressable by ALU. Cache memory can be used to
serve as a buffer between the CPU and the main memory. Block access of the
main memory can be achieved through multiway interleaving across parallel
memory modules . Virtual memory space can be established with the use of
disks and tape units at the outer levels.

Fig 3. The classical memory hierarchy

5- Multiprogramming and Time Sharing

Even when there is only one CPU in a uniprocessor system, we can still
achieve a high degree of resource sharing among many user programs. We
will briefly review the concepts of multiprogramming and time sharing in this
subsection. These are software approaches to achieve concurrency in a
uniprocessor system.

8
 Classification of computers by parallelism
Four computers architectural classification schemes are presented in this
section based on the multiplicity of instruction streams and data streams in a
computer system.

Computer organizations are characterized by the multiplicity of the hardware


provided to service the instruction and data streams. Listed below are Flynn's
four machine organizations:

 Single instruction stream-single data stream (SISD)


 Single instruction stream-multiple data stream (SIMD)
 Multiple instruction stream-single data stream (MISD)
 Multiple instruction stream-multiple data stream (MIMD)

These organizational classes are illustrated by the block diagrams in Figure 4.


The categorization depends on the multiplicity of simultaneous events in the
system components. Conceptually, only three types of system components are
needed in the illustration. Both instructions and data are fetched from the
memory modules. Instructions are decoded by the control unit, which sends
the decoded instruction stream to the processor units for execution. Data
streams flow between the processors and the memory bidirectionally.
Multiple memory modules may be used in the shared memory subsystem.
Each instruction stream is generated by an independent control unit. Multiple
data streams originate from the subsystem of shared memory modules. I/O
facilities are not shown in these simplified block diagrams.

SISD computer organization

This organization, shown in fig 4a , represents most serial computers available


today. Instructions are executed sequentially but may be overlapped in their
execution stages (pipelining). Most SISD uniprocessor systems are pipelined.
An SISD computer may have more than one functional unit in it. All the
functional units are under the supervision of one control unit.

SIMD computer organization

This class corresponds to array processors, . As illustrated in fig 4b, there are
multiple processing elements supervised by the same control unit. All PEs
receive the same instruction broadcast from the control unit but operate on

9
different data sets from distinct data streams. The shared memory subsystem
may contain multiple modules.

MISD computer organization

This organization is conceptually illustrated in fig 4c. There are n processor


units, each receiving distinct instructions operating over the same data stream
and its derivatives. The results (output) of one processor become the input
(operands) of the next processor in the macropipe. This structure has received
much less attention and has been challenged as impractical by some
computer architects. No real embodiment of this class exists.

MIMD computer organization

Most multiprocessor systems and multiple computer systems can be classified


in this category fig 4d. An intrinsic MIMD computer implies interactions
among the n processors because all memory streams are derived from the
same data space shared by all processors. If the n data streams were derived
from disjointed subspaces of the shared memories, then we would have the
so-called multiple SISD (MSISD) operation, which is nothing but a set of n
independent SISD uniprocessor systems.

10
Fig 4. Flynn's classification of various computer organizations.

11
Introduction to performance models
Designers evaluate the performance of computers and the effect of design
changes to a computer or subsystem.

In general, there are two common techniques for performing these


evaluations: simulation and analytic models. Simulation has a major
disadvantage of hiding the effects of workload or architectural parameters
that are the input to the simulator. The simulation produces some single
measure of performance but the underlying basis for the number is obscure.
On the other hand, analytic models must explicitly comprehend each of the
workload and architectural parameters.

Because of the small time required for obtaining a solution as parameters


vary, the effect of these parameters can be evaluated and understood.
However, these models do not generally comprehend concurrency and are
subject to significant error. [2]

 Computer Performance
For a computer or a computer subsystem , when it comes to measures of
performance that can be used by a designer when making design choices we
are generally interested in two things: the time to do tasks and the rate at
which given tasks are performed. For example,

Task A executes in 3 µs per task,

Task A executes at a rate of 3.33 x 105 executions per second

Note that for a common work load (Task A), time and rate have a reciprocal
relationship. Because a computer has a clock that controls all of its functions,
the number of clocks is frequently used as a measure of time.

If the clock period of the processor is 50 ns,

300 clocks per execution of Task A,

Task A executes at a rate of 0.0033 tasks per clock

12
The factors that affecting the performance of a computer

1- clock rate

the cpu of a computer is driven by a clock with a constant frequency f (the


clock rate ) the clock cycle time is given by
𝟏
clock cycle time 𝑻 =
𝑭

the number of clock cycle is frequently used as a measure of time

Example

Take the synchronizer Below . Assume that 15 ns is deemed to be a sufficient


clock delay to eliminate most, if not all, metastable conditions on flip flop 1.
What is the minimum clock period and maximum clock rate of this system?

Solution :

minimum clock period = 15 ns,


1
maximum clock rate = 15 𝑛𝑠 = 66.6 MHz.

2- program size IC

the number of instruction to be executed in the program to the instructions


count in IC . the higher the number of instruction, the higher is the speedup
factor using pipelining

3- Clocks per instruction (CPI)

A basic time model that is widely used in evaluating processors is clocks per
instruction (CPI). Which is the number of cycle required to execute an average
instruction

13
For example A task that takes 1 x 106 clocks to execute 5 x 105 instructions has
a CPI of 2.0. Small values of CPI indicate higher performance than large
values of CPI. For many processors, different instructions require a different
number of clocks for execution; thus CPI is an average value for all
instructions executed. Further, different programs use instructions in different
mixes. CPI combines the weighted use of each instruction class (the fraction of
the total instructions) with the number of clocks per instruction class to give a
weighted mean,

The reciprocal of CPI is instructions per clock (IPC). IPC is a measure of


processing rate and is a useful measure of performance in some situations.

So if we have program that consisting of different type i of instructionsm, to


calculate the effective CPI we use the following equation

Where

ICi = is the number of instructions for a given instruction type i

CCi = is the clock-cycles for that instruction type i

IC = is the total instruction count.

Explanation

Let us assume a classic RISC pipeline, with the following 5 stages:

1. Instruction fetch cycle (IF).


2. Instruction decode/Register fetch cycle (ID).
3. Execution/Effective address cycle (EX).
4. Memory access (MEM).
5. Write-back cycle (WB).

Each stage requires one clock cycle and an instruction passes through the
stages sequentially. Without pipelining, a new instruction is fetched in stage 1
only after the previous instruction finishes at stage 5, therefore the number of
clock cycles it takes to execute an instruction is 5 (CPI = 5 > 1). In this case, the
processor is said to be subscalar. With pipelining, a new instruction is fetched
every clock cycle by exploiting instruction-level parallelism, therefore, since
one could theoretically have 5 instructions in the 5 pipeline stages at once
(one instruction per stage), a different instruction would complete stage 5 in

14
every clock cycle and on average the number of clock cycles it takes to execute
an instruction is 1 (CPI = 1). In this case, the processor is said to be scalar.

Example

[3] For the multi-cycle MIPS, there are 5 types of instructions:

 Load (5 cycles)
 Store (4 cycles)
 R-type (4 cycles)
 Branch (3 cycles)
 Jump (3 cycles)

If a program has:

 50% load instructions


 15% R-type instructions
 25% store instructions
 8% branch instructions
 2% jump instructions

then, the CPI is:

4- Millions of instructions per second (MIPS)

The processor speed is often measured in terms of MIPS .because MIPS can
be useful when comparing performance between processors made with
similar architecture (e.g. Microchip branded microcontrollers), They are
difficult to compare between differing CPU architectures

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜 𝑜𝑓 𝑡𝑕𝑒 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑖𝑛 𝑚𝑖𝑙𝑙𝑖𝑜𝑛𝑠


MIPS rate =
𝑡𝑖𝑚𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑡𝑜 𝑒𝑥𝑐𝑢𝑡𝑒 𝑡𝑕𝑒 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑇

instruction time
𝑀𝐼𝑃𝑆 =
exctution time × 106
𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚
𝑐𝑙𝑜𝑐𝑘 𝑟𝑎𝑡𝑒

15
then

𝐂𝐥𝐨𝐜𝐤 𝐫𝐚𝐭𝐞
𝑴𝑰𝑷𝑺 =
(𝐂𝐏𝐈 𝐱 𝟏𝟎𝟔)

Example
[3] A 400-MHz processor was used to execute a benchmark program with the

following instruction mix and clock cycle count:

Determine the effective CPI, MIPS rate, and execution time for this program.

Solution

Total instruction count = 100000

therefore

Executing time (T) = CPI x Instruction count x clock time =

5- million floating-point operations per second (MFLOPS)

Some times MIPS can fail to give a true picture of performance in that it does
not track execution time.
So Another popular alternative to measure execution time is million floating-
point operations per second is used . The formula for MFLOPS is simply

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑙𝑜𝑎𝑡𝑖𝑛𝑔 − 𝑝𝑜𝑖𝑛𝑡 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑎 𝑝𝑟𝑜𝑔𝑟𝑎𝑚


𝑀𝐹𝐿𝑂𝑃𝑆 =
Execution time × 106

16
 Speedup
Designers are faced with the question of evaluating the effect of modifying a
design, called design A, to design B. Is the modified design, design B, better
than design A? The answer is found by using the concept of speedup. Note
that speedup is a dimensionless ratio:

If design B is an improvement, then the speedup will be greater than 1. If


design B hurts performance, the speedup will be less than 1. If speedup is
equal to 1, there is no performance change. This relationship recognizes that
reducing CPIs is associated with improved performance because the number
of clocks is a measure of time and an instruction is a measure of the work
performed.

 Means and Weighted Means


With computers, we can make two types of measurements or observations:
(1) the time needed to perform a task, (2) the rate at which a task is
performed,

There are three types of means used to find the central tendency of the
measurements: arithmetic mean for time, and harmonic mean for rates,

The measurement and observations for these three means may have equal
weights or be weighted.

In many of the subsequent discussions of weighted means, the term weight


will be used to refer to the fractional occurrence of an event, in which the
fraction is x/100, or to the frequency of an event, in which the frequency is
also given as x/100. There will be cases in which the context of the model
implies the use of the terms frequency of use or fraction of time rather than
weight.

17
Time-Based Means
Smith (Smith 1988) states that "the time required to perform a specific amount
of computation is the ultimate measure of computer performance." Thus time
measurements are usually fundamental measurements in the field of
computer performance modeling. When other measures are used, the validity
of these measures can usually be checked by converting to time

Arithmetic Mean
The arithmetic mean is used to find the central tendency for equal weight
time measurements. The arithmetic mean of the time per event is determined
by

An arithmetic mean requires that the data points have equal weights, and the
result is frequently called the average. For example, the average grade, x', in a
class is the sum of the observed grades divided by the number of students:

Example

A particular job is run on a corporate mainframe on the first of each month.


The times required for running this job over a 4-month period are 2 h, 2.2 h,
1.9 h, and 2.3 h. What is the mean or average time to run this job?

Solution

The data points have equal weight as the same job is run each month. Thus
the arithmetic mean or average is used to find the central tendency. The
mean or average time to run a job over the 4-month period is

18
Weighted Arithmetic Mean

For many cases, computing an equal-weight arithmetic mean will give


misleading results. Care must be taken when events occur at different
fractions of the total events and each event requires a different amount of
time. The weighted time per event is the weighted arithmetic mean, defined
as

The weighted arithmetic mean is the central tendency of time per unit of
work. Wi is the fraction that operation i is of the total operations, and Ti is the
time consumed by each use. Note that W1 + W2 + …+ Wn = 1 and that Wi is
not the fraction of time that the operation is in use.

Example

A processor has two classes of instructions: class A instructions take two


clocks to execute whereas class B instructions take three clocks to execute. Of
all the instructions executed, 75% are class A instructions and 25% are class B
instructions. What is the CPI of this processor?

Solution

The observations are in time and are weighted. Thus the CPI of the processor
is determined by the weighted arithmetic mean:

Note When solving a problem such as this one, add the event probabilities
together and verify that the sum is one; if the sum is not equal to one, there is
some error in the solution. A good practice is to use a table, as shown in
Table below, for the solution of these problems rather than attempt to bind
variables to an equation.

19
Rate-Based Means
Performance is sometimes measured as rates. For example, a car goes 25 MPH
or a computer performs 100 million instructions per second. Likewise, a
computer may execute 0.5 IPC, the reciprocal of CPI. Thus, instead of time, we
can also consider rates for evaluating performance or design changes. When
rates are the observed events, the harmonic mean and the weighted harmonic
mean will provide the central tendency of the observations.

Harmonic Mean
The harmonic mean is the central tendency of the observations expressed as
rates having equal weights and the result is the mean events per unit of time:

The harmonic mean is defined as the reciprocal of the arithmetic mean of the
reciprocals of the values and is the central tendency of units of work per units
of time.

Example

Consider the example in the subsection on arithmetic means of jobs being run
on the corporate computer. We express the observations in a rate measure of
jobs per hour. These data are 0.5, 0.45, 0.53, and 0.43 jobs per hour. What is
the central tendency of these measurements in jobs per hour?

Solution

The central tendency is found by using the harmonic mean:

Note that 0.476 jobs per hour is the reciprocal of 2.1 h per job found
previously with the arithmetic mean.

20
Weighted Harmonic Mean
When the observed rate data are weighted, the mean Becomes

where Wi is the fraction of the total task, not time, that is performed at Ri. The
result is the central tendency in weighted events per units of time.

Example

A program executes in two modes: 40% of the program instructions execute at


the rate of 10 million instructions per second and 60% execute at the rate of 5
million instructions per second. What is the weighted execution rate in
millions of instructions per second?

Solution

Because the observations are rates and the executions are weighted, we use
the weighted harmonic mean to find the central tendency, shown in Table
below

Example

Solve the same problem by using times in seconds per instruction, rather than
rates.

Solution

Because the data are in time and the executions are weighted, we use the
weighted arithmetic mean as shown in Table below

21
Note The two results have a reciprocal relationship, as expected:

 AMDAHL'S LAW
This law models the speedup of a computer when it is processing two classes
of tasks; one class can be speeded up whereas the other class cannot . This
situation is frequently encountered in computer system design. The solution
of Amdahl's law model is normalized to the execution time of the system
before any speedup is applied.

Assume that a program has two components, tl and t2. The component t2 can
be speeded up. So the overall speedup of the system is

And for a fixed load (some program) the speedup is defined as

𝑠𝑒𝑞𝑢𝑒𝑛𝑡𝑖𝑎𝑙 𝑒𝑥𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒


𝑠𝑝𝑒𝑒𝑑𝑢𝑝 𝑠 𝑛 =
𝑒𝑥𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 (𝑠𝑒𝑞𝑢𝑒𝑛𝑡𝑖𝑎𝑙 𝑝𝑎𝑟𝑡 + 𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 𝑝𝑎𝑟𝑡)

𝑇𝑠
𝑠 𝑛 =
𝑎 𝑇𝑆 + 1 − 𝑎 𝑇𝑠 /𝑛

Where

TS = execution time for sequential processing of the whole task with processor

𝑎 = fraction of execution time for sequential part of the program

n = number of parallel process

we can rewrite the above equation to be as follow

1
𝑠 𝑛 =
𝑎 + 1 − 𝑎 /𝑛

Multiplying num and den by n we get


𝒏
𝒔 𝒏 =
𝟏 + 𝒏−𝟏 𝒂

22
We consider these limits.

As a → 0 (in other words, the complete program can be speeded up & there is
no sequential load), speedup 𝑠(𝑛) → n

and as n → ∞ (in other words, t2 is reduced to zero), speedup s(n) → 1/a.

This second limit tells us that if 10% of the time of the original system cannot
be speeded up, then the greatest speedup possible for the system is 10.

The best speedup is upper-limited by 1/a , regardless of how many


processors are employed. I.e. the sequential portion of the program does not
change with respect to the number of processors in the computer

Example

An executing program is timed, and it is found that the serial portion (that
portion that cannot be speeded up) consumes 30 s whereas the other portion
that can be speeded up consumes 70 s of the time. You believe that by using
parallel processors, you can speed up this later portion by a factor of 8. What
is the speedup of the system?

Solution

A graphical solution to this Amdahl's law problem is shown in Fig below. The
70-s portion of the time is decreased by a factor of 8 to 8.75 s. The speedup of
the system is found by dividing the original time (100 s) by the enhanced time
(38.75 s). Thus speedup is 2.58.

The same result can be found by binding the arguments to Amdahl's law. For
the original system, a = 0.3, the enhancement is n = 8. Thus the speedup is
calculated as

23
 Relative vector/scalar performance
Let W = total amount of work done in a computer of which :

WS = work done in scalar execution (one processor)

WV = work done in vector (parallel or many processor)

W = WS + W V

Let RS and RV be scalar and vector execution ratio respectively in MIPS or


MFLOPS

Let α be the vectorization ratio ( α is the % age of code executed in parallel )


𝐖𝐕
Then 𝛂= 𝐖𝐒+ 𝐖𝐕

Let r be the vector/scalar speed ratio


𝐑𝐯
Then 𝐫 = 𝐑𝐬

If only scalar execution is used (vector mode not used at all) then
𝐖𝐬 + 𝐖𝐯
Scalar execution time (sequential) 𝐓𝐬 = 𝐑𝐬

After Combined vector and scalar execution time ( sequential + parallel) we


get
𝐖𝐬 𝐖𝐯
𝐓𝐜𝐨𝐦 = +
𝐑𝐬 𝐑𝐯

Divide 𝑇𝑠 by 𝑇𝑐𝑜𝑚 We get


𝐖𝐬 + 𝐖𝐯
𝐓𝐬 𝐑𝐬
Relative speed 𝐒 = = 𝐖𝐬 𝐑 𝐯 +𝐖𝐯𝐑 𝐬
𝐓𝐜𝐨𝐦
𝐑𝐬 𝐑𝐯

𝐖𝐬 + 𝐖𝐯
𝐑𝐬 = 𝐖𝐬 + 𝐖𝐯
𝐖𝐯 = 𝐫 𝐫𝐖𝐬 𝐖𝐯
𝐖𝐬 + 𝐖𝐯 𝐑 𝐖𝐬 + 𝐫
+
𝐯 𝐖𝐬 𝐖𝐯 𝐖𝐬 𝐖𝐯

Now substitute the vectorization ration by its value we get


𝒓 𝒓 𝟏
= = 𝜶
𝒓 𝟏−𝜶 + 𝜶 𝟏 −𝜶 𝒓+𝜶 𝟏 −𝜶 + 𝒓

Which is Amdahl’s law , just replace [𝛼 → (1 − 𝑟) ] & (r→ 𝑛 )

24
 Interleaved Memory Organization
In order to close up the speed gap between the CPU/cache and main memory
built with RAM modules, an interleaving technique is used below which
allows pipelined access of the parallel memory modules.

The memory design goal is to broaden the effective memory bandwidth so


that more memory words can be accessed per unit time. The ultimate purpose
is to match the memory bandwidth with the bus bandwidth and with the
processor bandwidth. [4]

Memory modules

The main memory is built with multiple modules. These memory modules
are Connected to a system bus or a switching network to which other
resources such as processors or I/O devices are also connected.

Once presented with a memory address, each memory module returns with
one word per cycle. It is possible to present different addresses to different
memory modules so that parallel access of multiple words can be done
simultaneously or in a pipelined fashion. Both parallel access and pipelined
access are forms of parallelism practiced in a parallel memory organization.

Consider a main memory formed with m = 2 a memory modules (“a” is


number of bits to access a module), each containing w = 2 b words of memory
cells (“b” is number of bits required to access a word with the module
”offset”).

The total memory capacity is m . w = 2a+b words. These memory words are
assigned linear addresses.

Different ways of assigning linear addresses result in different memory


organizations.

memory interleaving

Memory interleaving is the technique used to increase the throughput. The


core idea is to split the memory system into independent banks, which can
answer read or write requests independents in parallel.

25
there are two address formats for memory interleaving. High-order
interleaving and low-order interleaving

low-order interleaving spreads contiguous memory locations across the m


modules horizontally (Fig. 5.a]. This implies that the low-order a bits of the
memory address are used to identify the memory module. The high-order b
bits are the word addresses (displacement) within each module. Note that the
same word address is applied to all memory modules simultaneously. A
module address decoder is used to distribute module addresses.

High-order low-order interleaving (Fig. 5.b) uses the high-order a bits as the
module address and the low-order b bits as the word address within each
module. Contiguous memory locations are thus assigned to the same memory
module. in each memory cycle, only one word is accessed from each module.
Thus the high-order interleaving cannot support block access of contiguous
locations.

26
Fig.5. two interleaved memory organizations with m = 2 a modules

and w = 2b words per module {word address shown in boxes]

On the other hand, the low-order m-way interleaving does support block
access in a pipelined fashion
Unless otherwise specified, we consider only low-order memory interleaving
in subsequent discussion

27
 S-Access memory Organization
The low-order interleaved memory can be rearranged to allow simultaneous
access, or S-access , as illustrated in Fig. 6a. In this case, all memory modules
are accessed simultaneously in a synchronized manner. Again the high-order
(n — a) bits select the same offset word from each module.
At the and of each memory cycle [Fig. 6], m = 2a consecutive words are
latched in the data buffers simultaneously. The low-order a bits are then used
to multiplex the m words out, one per each minor cycle.
If the minor cycle is chosen to be 1/m of the major memory cycle , then it
takes two memory cycles to access m consecutive words.
However, if the access phase of the last access is overlapped with the fetch
phase of the current access (Fig. 6 b), effectively m words take only one
memory cycle to access. If the stride is greater than 1 , then the throughput
decreases, roughly proportionally to the stride.

28
Fig. 6 The S-access interleaved memory for vector operand access

 Pipelined memory Access


Access of the m memory modules can be overlapped in a pipelined fashion.
For this purpose. the memory cycle (called the major cycle ) is subdivided into
m miner cycles

29
Fig. 7 Multiway interleaved memory organization and the C-access timing
chart

An 8 way interleaved memory [with m = 8 and w = 8 and thus a = b = 3] is


shown in Fig. 7. Let Ө be the major cycle and t be the minor cycle. These two
cycle times are related as follows:

Ө
𝑡=
𝑚

30
where m is the degree of interleaving. The timing of the pipelined access of
the eight contiguous memory words is shown in Fig. 5.b. This type of
concurrent access of contiguous words has been called a C-access memory
scheme. The major cycle Ө is the total time required to complete the access of
a single word from a module. The minor cycle t is the actual time needed to
produce one word, assuming overlapped access of successive memory
modules separated in every minor cycle t.

Note that the pipelined access of the block of eight contiguous words is
sandwiched between other pipelined block accesses before and after the
present block. Even though the total block access time is 2 Ө , the effective
access time of each word is reduced to t as the memory is contiguously
accessed in a pipelined fashion.

 Bandwidth and Fault Tolerance


Hellerman (1967) has derived an equation to estimate the effective increase in
memory bandwidth through multiway interleaving. A single memory
module is assumed to deliver one word per memory cycle and thus has a
bandwidth of 1.

Memory Bandwidth
The memory bandwidth B of an m-way interleaved memory is upper-
bounded by m and lower-bounded by 1. The Hellerman estimate of B is

where m is the number of interleaved memory modules. This equation


implies that if 16 memory modules are used, then the effective memory
bandwidth is approximately four times that of a single module.

This pessimistic estimate is due to the fact that block access of various lengths
and access of single words are randomly mixed in user programs.
Hellerman's estimate was based on a single-processor system.

If memory-access conflicts from multiple processors (such as the hot spot


problem) are considered, the effective memory bandwidth will be further
reduced.

31
ln a vector processing computer, the access time of a long vector with n
elements and stride distance I has been estimated by Cragon (1992) as
follows: It is assumed that the n elements are stored in contiguous memory
locations in an or-way interleaved memory system. The average time t1 ,
required to access one element in a vector is estimated by

When 𝑛 → ∞ (very long vector), t1 → Ө/m = t . As n → 1 [scalar access), t1


→ Ө.

Above Equation conveys t1're message that interleaved memory appeals to


pipelined access of long vectors; the longer the better.

Fault Tolerance

High- and low-order interleaving can be combined to yield many different


interleaved memory organizations. Sequential addresses are assigned in the
high-order interleaved memory in each memory module.

This makes it easier to isolate faulty memory modules in a memory bank of m


memory modules. When one module failure is detected, the remaining
modules can still be used by opening a window in the address space. This
fault isolation cannot be carried out in a low -order interleaved memory, in
which a module failure may paralyze the entire memory bank. Thus low-
order interleaving memory is not fault-tolerant.

Example : Memory banks, fault tolerance, and bandwidth Trade offers

In Fig. 8, two alternative memory addressing schemes are shown which


combine the high- and low-order interleaving concept. These alternatives
offers better bandwidth in ease of module failure. A four-way low- order
interleaving is organized in each of two memory banks in Fig. 8 a.

32
Fig. 8 Bandwidth analysis of two iterative memory organization over eight
memory modules (absolute address shown in each memory bunk.)

in the other hand, two-way low-order interleaving is depicted in Fig. 8 b with


the memory system divided into four memory banks. The high-order bits are
used to identify the memory banks. The low-order bits are used to address the
modules for memory interleaving.
In case of single module failure, the maximum memory bandwidth of the
eight way interleaved memory
is reduced to zero because the entire in-emery bank must be abandoned. For
the four-way two bank design (Fig. 8 a), the maximum bandwidth is reduced

33
to four words per memory cycle because only one of the two faulty banks is
abandoned.
In the two-way design in Fig. 8 b, the gracefully degraded memory system
may still have three working memory banks; thus a maximum bandwidth of
six words is expected. The higher the degree of interleaving, the higher the
potential memory bandwidth if the system is fault-free.

34
References
[1] Kai Hwang, Faye A. Briggs-Computer Architecture and Parallel
Processing -Mcgraw-Hill College (1984)

[2] Harvey G Cragon-Computer architecture and implementation-Cambridge


University Press (2000)

[3] https://en.wikipedia.org/wiki/Cycles_per_instruction

[4] Kai Hwang, Naresh Jotwani-Advanced Computer Architecture_


Parallelism, Scalability, Programmability-Mcgraw-Hill Education (2008)

35

You might also like