Advanced Computer Architecture
Advanced Computer Architecture
applications
perform
more
sophisticated
concurrent activities in the computer. The highest level of parallel processing is conducted among
multiple jobs or programs through multiprogramming, time-sharing, and multiprocessing. This
presentation covers the basics of parallel computing. Beginning with a brief overview and some
concepts and terminology associated with parallel computing, the topics of parallel memory
architectures, Parallel computer architectures and Parallel programming models are then
explored.
Introduction:-
such techniques as carry-look ahead and carry-save are now built into almost all
ALUs. High-speed multiplier recording and convergency division are techniques for
exploring parallelism.
This is the oldest and until recently, the most prevalent form of computer
Examples: most PCs, single CPU workstations and mainframes
Figure 4.1 SISD COMPUTER
20
4.2.1.2 SIMD Architecture
A type of parallel computer
Single instruction: All processing units execute the same instruction at any given clock
cycle
Multiple data: Each processing unit can operate on a different data element
This type of machine typically has an instruction dispatcher, a very high-bandwidth
internal network, and a very large array of very small-capacity instruction units.
Best suited for specialized problems characterized by a high degree of regularity, such as
image processing.
Synchronous (lockstep) and deterministic execution
Two varieties: Processor Arrays and Vector Pipelines
Examples:
o Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2
o Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi S820
Figure 4.2 SIMD COMPUTER
CU-control unit
PU-processor unit
MM-memory module
SM-Shared memory
IS-instruction stream
DS-data stream
4.2.1.3 MISD Architecture
21
There are n processor units, each receiving distinct instructions operating over the same data
streams and its derivatives. The output of one processor become input of the other in the macro
pipe. No real embodiment of this class exists.
A single data stream is fed into multiple processing units.
Each processing unit operates on the data independently via independent instruction
streams.
Few actual examples of this class of parallel computer have ever existed. One is the
experimental Carnegie-Mellon C.mmp computer (1971).
Some conceivable uses might be:
o multiple frequency filters operating on a single signal stream
o multiple cryptography algorithms attempting to crack a single coded message.
Figure 4.3 MISD COMPUTER
4.2.1.4 MIMD Architecture
Multiple-instruction multiple-data streams (MIMD) parallel architectures are made of multiple
processors and multiple memory modules connected together via some interconnection network.
They fall into two broad categories: shared memory or message passing. Processors exchange
information through their central shared memory in shared memory systems, and exchange
information through their interconnection network in message passing systems.
22
Currently, the most common type of parallel computer. Most modern computers fall into
this category.
Multiple Instruction: every processor may be executing a different instruction stream
Multiple Data: every processor may be working with a different data stream
Execution can be synchronous or asynchronous, deterministic or non-deterministic
Examples: most current supercomputers, networked parallel computer "grids" and
multiprocessor
SMP computers - including some types of PCs.
A shared memory system typically accomplishes interprocessor coordination through a global
memory shared by all processors. These are typically server systems that communicate through a
bus and cache memory controller.
A message passing system (also referred to as distributed memory) typically combines the local
memory and processor at each node of the interconnection network. There is no global memory,
so it is necessary to move data from one local memory to another by means of message passing.
Figure 4.4 MIMD COMPUTER
Computer Class Computer System Models
1. SISD IBM 701, IBM 1620, IBM 7090, PDP VAX11/ 780
2. SISD (With
multiple
functional units)
IBM360/91 (3); IBM 370/168 UP
3. SIMD (Word
Slice
Processing)
Illiac IV ; PEPE
4. SIMD (Bit Slice STARAN; MPP; DAP
23
processing)
5. MIMD (Loosely
Coupled)
IBM 370/168 MP; Univac 1100/80
6. MIMD(Tightly
Coupled)
Burroughs- D 825
Table 4.1 Flynns Computer System Classification
4.2.2 Fengs Classification
Tse-yun Feng suggested the use of degree of parallelism to classify various computer
architectures.
Serial Versus Parallel Processing
The maximum number of binary digits that can be processed within a unit time by a
computer system is called the maximum parallelism degree P.
A bit slice is a string of bits one from each of the words at the same vertical position.
There are 4 types of methods under above classification
Word Serial and Bit Serial (WSBS)
Word Parallel and Bit Serial (WPBS)
Word Serial and Bit Parallel(WSBP)
when the Caltech Concurrent Computation project built a supercomputer for scientific applications from 64 Intel
8086/8087 processors. This system showed that extreme performance could be achieved with mass market, off the
shelf microprocessors. These massively parallel processors (MPPs) came to dominate the top end of computing, with
the ASCI Red supercomputer computer in 1997 breaking the barrier of one trillion floating point operations per
second. Since then, MPPs have continued to grow in size and power.
Starting in the late 80s, clusters came to compete and eventually displace MPPs for many applications. A cluster is
a type of parallel computer built from large numbers of off-the-shelf computers connected by an off-the-shelf network.
Today, clusters are the workhorse of scientific computing and are the dominant architecture in the data centers that
power the modern information age.
nToday, parallel computing is becoming mainstream based on multi-core processors. Most desktop and laptop
systems now ship with dual-core microprocessors, with quad-core processors readily available. Chip manufacturers
have begun to increase overall processing performance by adding additional CPU cores. The reason is that
increasing performance through parallel processing can be far more energy-efficient than increasing microprocessor
clock frequencies. In a world which is increasingly mobile and energy conscious, this has become essential.
Fortunately, the continued transistor scaling predicted by Moores Law will allow for a transition from a few cores to
many.
Parallel Software
The software world has been very active part of the evolution of parallel computing. Parallel programs have been
harder to write than sequential ones. A program that is divided into multiple concurrent tasks is more difficult to write,
due to the necessary synchronization and communication that needs to take place between those tasks. Some
standards have emerged. For MPPs and clusters, a number of application programming interfaces converged to a
single standard called MPI by the mid 1990s. For shared memory multiprocessor computing, a similar process
unfolded with convergence around two standards by the mid to late 1990s: pthreads and OpenMP. In addition to
these, a multitude of competing parallel programming models and languages have emerged over the years. Some of
these models and languages may provide a better solution to the parallel programming problem than the above
standards, all of which are modifications to conventional, non-parallel languages like C.
As multi-core processors bring parallel computing to mainstream customers, the key challenge in computing today is
to transition the software industry to parallel programming. The long history of parallel software has not revealed any
silver bullets, and indicates that there will not likely be any single technology that will make parallel software
ubiquitous. Doing so will require broad collaborations THE MANYCORE SHIFT: Microsoft Parallel Computing Initiative Ushers
Computing into the Next Era | 2
across industry and academia to create families of technologies that work together to bring the power of parallel
computing to future mainstream applications. The changes needed will affect the entire industry, from consumers to
hardware manufacturers and from the entire software development infrastructure to application developers who rely
upon it.
Future capabilities such as photorealistic graphics, computational perception, and machine learning
really heavily on highly parallel algorithms. Enabling these capabilities will advance a new generation
of experiences that expand the scope and efficiency of what users can accomplish in their digital
lifestyles and work place. These experiences include more natural, immersive, and increasingly multisensory interactions that offer multi-dimensional richness and context awareness. The future for
parallel computing is bright, but with new opportunities come new challenges.
In these four operating modes, the degree of parallelism increase sharply from phase to phase.
We define parallel processing as
Parallel processing is an efficient form of information processing which emphasizes the
exploitation of concurrent events in the computing process. Concurrency implies parallelism,
simultaneity, and pipelining. Parallel processing demands concurrent executiom of many
programs in the computer. The highest level of parallel processing is conducted among multiple
jobs or programs through multiprogramming, time sharing, and multiprocessing.
Parallel processing can be challenged in four programmatic levels:
Job or program level
Task or procedure level
Interinstruction level
Intrainstruction level
The highest job level is often conducted algorithmically. The lowest intra-instruction level is
often implemented directly by hardware means. Hardware roles increase from high to low levels.
On the other hand, software implementations increase from low to high levels.
Information
Processing
Increasing Complexity
and Sophistication in
Processing
Intelligence
Processing
Knowledge
Processing
Data Processing
Increasing Volumes
of raw material to be
processed
6
Figure 1.2 The system architecture of the super mini VAX 11/780 microprocessor system
The trend is also supported by the increasing demand for a faster real-time, resource
sharing and fault-tolerant computing environment.
MainMemory
232Words of 32
bits each
Console
Diagnostic Memory
Floating Point Accelerator
CPU
R0...
PC
ALU
Registers
Local
Memory
Floppy
Disk
Uni Bus
I/O Devices
MassBus
I/O Devices
SBI I/O Device
Input Output Sub System
7
It requires a broad knowledge of and experience with all aspects of algorithms, languages,
software, hardware, performance evaluation and computing alternatives.
To achieve parallel processing requires the development of more capable and cost
effective computer system.
With respect to parallel processing, the general architecture trend is being shifted from
conventional uniprocessor systems to multiprocessor systems to an array of processing elements
controlled by one uniprocessor. From the operating system point of view computer systems have
been improved to batch processing, multiprogramming, and time sharing and multiprocessing.
Computers to be used in the 1990 may be the next generation and very large scale integrated
chips will be used with high density modular design. More than 1000 mega float point operation
per second are expected in these future supercomputers. The evolution of computer systems
helps in learning the generations of computer systems.
4. Principles of pipelining
The two major parametric
considerations in designing a parallel
computer architecture are:
executing multiple number of
instructions in parallel,
increasing the efficiency of
processors.
There are various methods by which
instructions can be executed in parallel
Pipelining is one of the classical
and effective methods to increase
parallelism where different stages
perform repeated functions on
different operands.
Vector processing is the
arithmetic or logical computation
applied on vectors whereas in
scalar processing only one data
item or a pair of data items is
processed.
5.Array Processing
Scalar Pipelines
Scalar to superscalar[edit]
The simplest processors are scalar processors. Each instruction executed by a scalar
processor typically manipulates one or two data items at a time. By contrast, each
instruction executed by a vector processor operates simultaneously on many data
items. An analogy is the difference between scalar and vector arithmetic. A
superscalar processor is a mixture of the two. Each instruction processes one data
item, but there are multiple functional units within each CPU thus multiple
instructions can be processing separate data items concurrently.
Superscalar CPU design emphasizes improving the instruction dispatcher accuracy,
and allowing it to keep the multiple functional units in use at all times. This has
become increasingly important as the number of units has increased. While early
superscalar CPUs would have two ALUs and a single FPU, a modern design such as
the PowerPC 970 includes four ALUs, two FPUs, and two SIMD units. If the dispatcher
is ineffective at keeping all of these units fed with instructions, the performance of
the system will be no better than that of a simpler, cheaper design.
A superscalar processor usually sustains an execution rate in excess of one
instruction per machine cycle. But merely processing multiple instructions
concurrently does not make an architecture superscalar, since pipelined,
multiprocessor or multi-core architectures also achieve that, but with different
methods.
In a superscalar CPU the dispatcher reads instructions from memory and decides
which ones can be run in parallel, dispatching each to one of the several functional
units contained inside a single CPU. Therefore, a superscalar processor can be
envisioned having multiple parallel pipelines, each of which is processing
instructions simultaneously from a single instruction thread.
Limitations[edit]
Available performance improvement from superscalar techniques is limited by three
key areas:
The degree of intrinsic parallelism in the instruction stream (instructions requiring
the same computational resources from the CPU).
The complexity and time cost of dependency checking logic and register renaming
circuitry
The branch instruction processing.
Existing binary executable programs have varying degrees of intrinsic parallelism. In
some cases instructions are not dependent on each other and can be executed
simultaneously. In other cases they are inter-dependent: one instruction impacts
either resources or results of the other. The instructions a = b + c; d = e + f can be
run in parallel because none of the results depend on other calculations. However,
the instructions a = b + c; b = e + f might not be runnable in parallel, depending on
the order in which the instructions complete while they move through the units.
When the number of simultaneously issued instructions increases, the cost of
dependency checking increases extremely rapidly. This is exacerbated by the need
to check dependencies at run time and at the CPU's clock rate. This cost includes
additional logic gates required to implement the checks, and time delays through
those gates. Research[citation needed] shows the gate cost in some cases may be
gates, and the delay cost , where is the number of instructions in the processor's
instruction set, and is the number of simultaneously dispatched instructions.
Even though the instruction stream may contain no inter-instruction dependencies,
a superscalar CPU must nonetheless check for that possibility, since there is no
assurance otherwise and failure to detect a dependency would produce incorrect
results.
No matter how advanced the semiconductor process or how fast the switching
speed, this places a practical limit on how many instructions can be simultaneously
dispatched. While process advances will allow ever greater numbers of functional
units (e.g., ALUs), the burden of checking instruction dependencies grows rapidly, as
does the complexity of register renaming circuitry to mitigate some dependencies.
Collectively the power consumption, complexity and gate delay costs limit the
achievable superscalar speedup to roughly eight simultaneously dispatched
instructions.
However even given infinitely fast dependency checking logic on an otherwise
conventional superscalar CPU, if the instruction stream itself has many
dependencies, this would also limit the possible speedup. Thus the degree of
intrinsic parallelism in the code stream forms a second limitation.