Computer Performance
Computer Performance
UNIVERSITY OF BAHRI
COLLAGE OF ENGINEERING AND ARCHITECTURE
Present by :
Group A
Supervisor :
1
Parallel processing
Parallel processing has emerged as a key technology in modern computers
driven by the increasing demand for higher performance, lower costs and
good productivity in real-life applications.
Current events are taking place in today’s high performance computers due
to the common practice of multiprogramming, multiprocessing, or
multicomputing
Batch processing
Multiprogramming
Time sharing
Multiprocessing
Definition
2
Parallel processing can be performed in four programmatic levels:
This is the next highest level of parallel processing & is conducted among
procedures or tasks (program segments) within the same program. This
involves the decomposition of a program into multiple tasks.
Instruction level
Intrainstruction level
Finally, we may wish to have faster and concurrent operations within each
instruction.
3
Parallelism In Uniprocessor System
Most general-purpose uniprocessor systems have the same basic structure.
4
Fig 1. shows the architectural components of the super minicomputer VAX-
11/780, manufactured by Digital Equipment Company. The CPU contains the
master controller of the VAX system. There are sixteen 32 -bit general-purpose
registers, one of which serves as the program counter (PC). There is also a
special CPU status register containing information about the current state of
the processor and of the program being executed. The CPU contains an
arithmetic and logic unit (ALU) with an optional floating-point accelerator,
and some local cache memory with an optional diagnostic memory. The CPU
can be intervened by the operator through the console connected to a floppy
disk.
The CPU, the main memory (232 words of 32 bits each), and the I/O
subsystems are all connected to a common bus, the synchronous backplane
interconnect (SBI). Through this bus, all I/O devices can communicate with
each other, with the CPU, or with the memory. Peripheral storage or I/O
devices can be connected directly to the SBI through the unibus and its
controller (which can be connected to PDP-11 series minicomputers), or
through a massbus and it controller.
5
1- Multiplicity of functional units
The early computer had only one arithmetic and logic unit in its CPU.
Furthermore, the ALU could only perform one function at a time, a rather
slow process for executing a long sequence of arithmetic logic instructions. In
practice, many of the functions of the ALU can be distributed to multiple and
specialized functional units which can operate in parallel. The CDC-6600
(designed in 1964) has 10 functional units built into its CPU (Fig. 2.). These 10
units are independent of each other and may operate simultaneously. A
scoreboard is used to keep track of the availability of the functional units and
registers being demanded. With 10 functional units and 24 registers available,
the instruction issue rate can be significantly increased.
6
Almost all modern computers and attached processors are equipped with
multiple functional units to perform parallel or simultaneous arithmetic logic
operations. This practice of functional specialization and distribution can be
extended to array processors and multiprocessors,
using such techniques as carry-lookahead and carry-save, are now built into
almost all ALUs. This is in contrast to the bit-serial adders used in the first-
generation machines. High-speed multiplier recoding and convergence
division are techniques for exploring parallelism and the sharing of hardware
resources for the functions of multiply and. The use of multiple functional
units is a form of parallelism with the CPU.
Usually, the CPU is about 1000 times faster than memory access. A
hierarchical memory system can be used to close up the speed gap. Computer
7
memory hierarchy is conceptually illustrated in fig 3. The innermost level is
the register files directly addressable by ALU. Cache memory can be used to
serve as a buffer between the CPU and the main memory. Block access of the
main memory can be achieved through multiway interleaving across parallel
memory modules . Virtual memory space can be established with the use of
disks and tape units at the outer levels.
Even when there is only one CPU in a uniprocessor system, we can still
achieve a high degree of resource sharing among many user programs. We
will briefly review the concepts of multiprogramming and time sharing in this
subsection. These are software approaches to achieve concurrency in a
uniprocessor system.
8
Classification of computers by parallelism
Four computers architectural classification schemes are presented in this
section based on the multiplicity of instruction streams and data streams in a
computer system.
This class corresponds to array processors, . As illustrated in fig 4b, there are
multiple processing elements supervised by the same control unit. All PEs
receive the same instruction broadcast from the control unit but operate on
9
different data sets from distinct data streams. The shared memory subsystem
may contain multiple modules.
10
Fig 4. Flynn's classification of various computer organizations.
11
Introduction to performance models
Designers evaluate the performance of computers and the effect of design
changes to a computer or subsystem.
Computer Performance
For a computer or a computer subsystem , when it comes to measures of
performance that can be used by a designer when making design choices we
are generally interested in two things: the time to do tasks and the rate at
which given tasks are performed. For example,
Note that for a common work load (Task A), time and rate have a reciprocal
relationship. Because a computer has a clock that controls all of its functions,
the number of clocks is frequently used as a measure of time.
12
The factors that affecting the performance of a computer
1- clock rate
Example
Solution :
2- program size IC
A basic time model that is widely used in evaluating processors is clocks per
instruction (CPI). Which is the number of cycle required to execute an average
instruction
13
For example A task that takes 1 x 106 clocks to execute 5 x 105 instructions has
a CPI of 2.0. Small values of CPI indicate higher performance than large
values of CPI. For many processors, different instructions require a different
number of clocks for execution; thus CPI is an average value for all
instructions executed. Further, different programs use instructions in different
mixes. CPI combines the weighted use of each instruction class (the fraction of
the total instructions) with the number of clocks per instruction class to give a
weighted mean,
Where
Explanation
Each stage requires one clock cycle and an instruction passes through the
stages sequentially. Without pipelining, a new instruction is fetched in stage 1
only after the previous instruction finishes at stage 5, therefore the number of
clock cycles it takes to execute an instruction is 5 (CPI = 5 > 1). In this case, the
processor is said to be subscalar. With pipelining, a new instruction is fetched
every clock cycle by exploiting instruction-level parallelism, therefore, since
one could theoretically have 5 instructions in the 5 pipeline stages at once
(one instruction per stage), a different instruction would complete stage 5 in
14
every clock cycle and on average the number of clock cycles it takes to execute
an instruction is 1 (CPI = 1). In this case, the processor is said to be scalar.
Example
Load (5 cycles)
Store (4 cycles)
R-type (4 cycles)
Branch (3 cycles)
Jump (3 cycles)
If a program has:
The processor speed is often measured in terms of MIPS .because MIPS can
be useful when comparing performance between processors made with
similar architecture (e.g. Microchip branded microcontrollers), They are
difficult to compare between differing CPU architectures
instruction time
𝑀𝐼𝑃𝑆 =
exctution time × 106
𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚
𝑐𝑙𝑜𝑐𝑘 𝑟𝑎𝑡𝑒
15
then
𝐂𝐥𝐨𝐜𝐤 𝐫𝐚𝐭𝐞
𝑴𝑰𝑷𝑺 =
(𝐂𝐏𝐈 𝐱 𝟏𝟎𝟔)
Example
[3] A 400-MHz processor was used to execute a benchmark program with the
Determine the effective CPI, MIPS rate, and execution time for this program.
Solution
therefore
Some times MIPS can fail to give a true picture of performance in that it does
not track execution time.
So Another popular alternative to measure execution time is million floating-
point operations per second is used . The formula for MFLOPS is simply
16
Speedup
Designers are faced with the question of evaluating the effect of modifying a
design, called design A, to design B. Is the modified design, design B, better
than design A? The answer is found by using the concept of speedup. Note
that speedup is a dimensionless ratio:
There are three types of means used to find the central tendency of the
measurements: arithmetic mean for time, and harmonic mean for rates,
The measurement and observations for these three means may have equal
weights or be weighted.
17
Time-Based Means
Smith (Smith 1988) states that "the time required to perform a specific amount
of computation is the ultimate measure of computer performance." Thus time
measurements are usually fundamental measurements in the field of
computer performance modeling. When other measures are used, the validity
of these measures can usually be checked by converting to time
Arithmetic Mean
The arithmetic mean is used to find the central tendency for equal weight
time measurements. The arithmetic mean of the time per event is determined
by
An arithmetic mean requires that the data points have equal weights, and the
result is frequently called the average. For example, the average grade, x', in a
class is the sum of the observed grades divided by the number of students:
Example
Solution
The data points have equal weight as the same job is run each month. Thus
the arithmetic mean or average is used to find the central tendency. The
mean or average time to run a job over the 4-month period is
18
Weighted Arithmetic Mean
The weighted arithmetic mean is the central tendency of time per unit of
work. Wi is the fraction that operation i is of the total operations, and Ti is the
time consumed by each use. Note that W1 + W2 + …+ Wn = 1 and that Wi is
not the fraction of time that the operation is in use.
Example
Solution
The observations are in time and are weighted. Thus the CPI of the processor
is determined by the weighted arithmetic mean:
Note When solving a problem such as this one, add the event probabilities
together and verify that the sum is one; if the sum is not equal to one, there is
some error in the solution. A good practice is to use a table, as shown in
Table below, for the solution of these problems rather than attempt to bind
variables to an equation.
19
Rate-Based Means
Performance is sometimes measured as rates. For example, a car goes 25 MPH
or a computer performs 100 million instructions per second. Likewise, a
computer may execute 0.5 IPC, the reciprocal of CPI. Thus, instead of time, we
can also consider rates for evaluating performance or design changes. When
rates are the observed events, the harmonic mean and the weighted harmonic
mean will provide the central tendency of the observations.
Harmonic Mean
The harmonic mean is the central tendency of the observations expressed as
rates having equal weights and the result is the mean events per unit of time:
The harmonic mean is defined as the reciprocal of the arithmetic mean of the
reciprocals of the values and is the central tendency of units of work per units
of time.
Example
Consider the example in the subsection on arithmetic means of jobs being run
on the corporate computer. We express the observations in a rate measure of
jobs per hour. These data are 0.5, 0.45, 0.53, and 0.43 jobs per hour. What is
the central tendency of these measurements in jobs per hour?
Solution
Note that 0.476 jobs per hour is the reciprocal of 2.1 h per job found
previously with the arithmetic mean.
20
Weighted Harmonic Mean
When the observed rate data are weighted, the mean Becomes
where Wi is the fraction of the total task, not time, that is performed at Ri. The
result is the central tendency in weighted events per units of time.
Example
Solution
Because the observations are rates and the executions are weighted, we use
the weighted harmonic mean to find the central tendency, shown in Table
below
Example
Solve the same problem by using times in seconds per instruction, rather than
rates.
Solution
Because the data are in time and the executions are weighted, we use the
weighted arithmetic mean as shown in Table below
21
Note The two results have a reciprocal relationship, as expected:
AMDAHL'S LAW
This law models the speedup of a computer when it is processing two classes
of tasks; one class can be speeded up whereas the other class cannot . This
situation is frequently encountered in computer system design. The solution
of Amdahl's law model is normalized to the execution time of the system
before any speedup is applied.
Assume that a program has two components, tl and t2. The component t2 can
be speeded up. So the overall speedup of the system is
𝑇𝑠
𝑠 𝑛 =
𝑎 𝑇𝑆 + 1 − 𝑎 𝑇𝑠 /𝑛
Where
TS = execution time for sequential processing of the whole task with processor
1
𝑠 𝑛 =
𝑎 + 1 − 𝑎 /𝑛
22
We consider these limits.
As a → 0 (in other words, the complete program can be speeded up & there is
no sequential load), speedup 𝑠(𝑛) → n
This second limit tells us that if 10% of the time of the original system cannot
be speeded up, then the greatest speedup possible for the system is 10.
Example
An executing program is timed, and it is found that the serial portion (that
portion that cannot be speeded up) consumes 30 s whereas the other portion
that can be speeded up consumes 70 s of the time. You believe that by using
parallel processors, you can speed up this later portion by a factor of 8. What
is the speedup of the system?
Solution
A graphical solution to this Amdahl's law problem is shown in Fig below. The
70-s portion of the time is decreased by a factor of 8 to 8.75 s. The speedup of
the system is found by dividing the original time (100 s) by the enhanced time
(38.75 s). Thus speedup is 2.58.
The same result can be found by binding the arguments to Amdahl's law. For
the original system, a = 0.3, the enhancement is n = 8. Thus the speedup is
calculated as
23
Relative vector/scalar performance
Let W = total amount of work done in a computer of which :
W = WS + W V
If only scalar execution is used (vector mode not used at all) then
𝐖𝐬 + 𝐖𝐯
Scalar execution time (sequential) 𝐓𝐬 = 𝐑𝐬
𝐖𝐬 + 𝐖𝐯
𝐑𝐬 = 𝐖𝐬 + 𝐖𝐯
𝐖𝐯 = 𝐫 𝐫𝐖𝐬 𝐖𝐯
𝐖𝐬 + 𝐖𝐯 𝐑 𝐖𝐬 + 𝐫
+
𝐯 𝐖𝐬 𝐖𝐯 𝐖𝐬 𝐖𝐯
24
Interleaved Memory Organization
In order to close up the speed gap between the CPU/cache and main memory
built with RAM modules, an interleaving technique is used below which
allows pipelined access of the parallel memory modules.
Memory modules
The main memory is built with multiple modules. These memory modules
are Connected to a system bus or a switching network to which other
resources such as processors or I/O devices are also connected.
Once presented with a memory address, each memory module returns with
one word per cycle. It is possible to present different addresses to different
memory modules so that parallel access of multiple words can be done
simultaneously or in a pipelined fashion. Both parallel access and pipelined
access are forms of parallelism practiced in a parallel memory organization.
The total memory capacity is m . w = 2a+b words. These memory words are
assigned linear addresses.
memory interleaving
25
there are two address formats for memory interleaving. High-order
interleaving and low-order interleaving
High-order low-order interleaving (Fig. 5.b) uses the high-order a bits as the
module address and the low-order b bits as the word address within each
module. Contiguous memory locations are thus assigned to the same memory
module. in each memory cycle, only one word is accessed from each module.
Thus the high-order interleaving cannot support block access of contiguous
locations.
26
Fig.5. two interleaved memory organizations with m = 2 a modules
On the other hand, the low-order m-way interleaving does support block
access in a pipelined fashion
Unless otherwise specified, we consider only low-order memory interleaving
in subsequent discussion
27
S-Access memory Organization
The low-order interleaved memory can be rearranged to allow simultaneous
access, or S-access , as illustrated in Fig. 6a. In this case, all memory modules
are accessed simultaneously in a synchronized manner. Again the high-order
(n — a) bits select the same offset word from each module.
At the and of each memory cycle [Fig. 6], m = 2a consecutive words are
latched in the data buffers simultaneously. The low-order a bits are then used
to multiplex the m words out, one per each minor cycle.
If the minor cycle is chosen to be 1/m of the major memory cycle , then it
takes two memory cycles to access m consecutive words.
However, if the access phase of the last access is overlapped with the fetch
phase of the current access (Fig. 6 b), effectively m words take only one
memory cycle to access. If the stride is greater than 1 , then the throughput
decreases, roughly proportionally to the stride.
28
Fig. 6 The S-access interleaved memory for vector operand access
29
Fig. 7 Multiway interleaved memory organization and the C-access timing
chart
Ө
𝑡=
𝑚
30
where m is the degree of interleaving. The timing of the pipelined access of
the eight contiguous memory words is shown in Fig. 5.b. This type of
concurrent access of contiguous words has been called a C-access memory
scheme. The major cycle Ө is the total time required to complete the access of
a single word from a module. The minor cycle t is the actual time needed to
produce one word, assuming overlapped access of successive memory
modules separated in every minor cycle t.
Note that the pipelined access of the block of eight contiguous words is
sandwiched between other pipelined block accesses before and after the
present block. Even though the total block access time is 2 Ө , the effective
access time of each word is reduced to t as the memory is contiguously
accessed in a pipelined fashion.
Memory Bandwidth
The memory bandwidth B of an m-way interleaved memory is upper-
bounded by m and lower-bounded by 1. The Hellerman estimate of B is
This pessimistic estimate is due to the fact that block access of various lengths
and access of single words are randomly mixed in user programs.
Hellerman's estimate was based on a single-processor system.
31
ln a vector processing computer, the access time of a long vector with n
elements and stride distance I has been estimated by Cragon (1992) as
follows: It is assumed that the n elements are stored in contiguous memory
locations in an or-way interleaved memory system. The average time t1 ,
required to access one element in a vector is estimated by
Fault Tolerance
32
Fig. 8 Bandwidth analysis of two iterative memory organization over eight
memory modules (absolute address shown in each memory bunk.)
33
to four words per memory cycle because only one of the two faulty banks is
abandoned.
In the two-way design in Fig. 8 b, the gracefully degraded memory system
may still have three working memory banks; thus a maximum bandwidth of
six words is expected. The higher the degree of interleaving, the higher the
potential memory bandwidth if the system is fault-free.
34
References
[1] Kai Hwang, Faye A. Briggs-Computer Architecture and Parallel
Processing -Mcgraw-Hill College (1984)
[3] https://en.wikipedia.org/wiki/Cycles_per_instruction
35