Parallelism in Computer Architecture
Parallelism in Computer Architecture
Prepared by:Dr.J.Vinothkumar
TableofContents PageNo.
S.No PageNo.
TableofContents
1 Aim & objective 2
2 Prerequisite 2
4 Parallelism theory 3
6 Reference 27
Parallelism in Computer Architecture
Answer: Cache
4. Parallelism
4.1 Introduction
Parallel Processing
Parallel processing can be described as a class of techniques which enables the
system to achieve simultaneous data-processing tasks to increase the
computational speed of a computersystem.
A parallel processing system can carry out simultaneous data-processing to
achieve faster executiontime.
For instance, while an instruction is being processed in the ALU component of the
CPU, the next instruction can be read frommemory.
The primary purpose of parallel processing is to enhance the computer processing
capability and increase itsthroughput,
A parallel processing system can be achieved by having a multiplicity of
functional units that perform identical or different operationssimultaneously.
The data can be distributed among various multiple functionalunits.
The following diagram shows one possible way of separating the execution unit
into eight functional units operating inparallel.
The operation performed in each functional unit is indicated in each block if the
diagram:
The adder and integer multiplier performs the arithmetic operation with integer
numbers.
The floating-point operationsare separated into three circuits operating in parallel.
The logic, shift, and increment operations can be performed concurrently on
differentdata.
All units are independent of each other, so one number can be shifted while
another number is beingincremented.
Parallel computers can be roughly classified according to the level at which the
hardware supports parallelism, with multi-core and multi-processor computers
having multiple processing elements within a singlemachine.
In some cases parallelism is transparent to the programmer, such as in bit-level or
instruction-levelparallelism.
But explicitly parallel algorithms, particularly those that use concurrency, are more
difficult to write than sequential ones, because concurrency introduces several new
classesofpotentialsoftwarebugs,ofwhichraceconditionsarethemostcommon.
Communication and synchronization between the different subtasks are typically
some of the greatest obstacles to getting optimal parallel programperformance.
Types of Parallelism:
1. Bit-level parallelism: It is the form of parallel computing which is based on the
increasing processor’s size. It reduces the number of instructions that the system
must execute in order to perform a task on large-sized data.
Example: Consider a scenario where an 8-bit processor must compute the sum of
two 16-bit integers. It must first sum up the 8 lower-order bits, then add the 8
higher-order bits, thus requiring two instructions to perform the operation. A 16-
bit processor can perform the operation with just oneinstruction.
2. Instruction-level parallelism: A processor can only address less than one
instructionforeachclockcyclephase.Theseinstructionscanbere-orderedand
grouped which are later on executed concurrently without affecting the result of the
program. This is called instruction-levelparallelism.
3. Task Parallelism: Task parallelism employs the decomposition of a task into
subtasks and then allocating each of the subtasks for execution. The processors
perform execution of sub tasksconcurrently.
4. Data-level parallelism (DLP) – Instructions from a single stream operate
concurrently on several data – Limited by non-regular data manipulation
patterns and by memorybandwidth
Architectural Trends
When multiple operations are executed in parallel, the number of cycles needed to
execute the program isreduced.
However, resources are needed to support each of the concurrentactivities.
Resources are also needed to allocate localstorage.
The best performance is achieved by an intermediate action plan that uses
resources to utilize a degree of parallelism and a degree oflocality.
Generally, the history of computer architecture has been divided into four
generations having following basic technologies−
Vacuum tubes
Transistors
Integratedcircuits
VLSI
Till 1985, the duration was dominated by the growth in bit-levelparallelism.
4-bit microprocessors followed by 8-bit, 16-bit, and soon.
To reduce the number of cycles needed to perform a full 32-bit operation, the
widthofthedatapathwasdoubled.Lateron,64-bitoperationswereintroduced.
The growth in instruction-level-parallelism dominated the mid-80s tomid-90s.
The RISC approach showed that it was simple to pipeline the steps of instruction
processingsothatonanaverageaninstructionisexecutedinalmosteverycycle.
Growth in compiler technology has made instruction pipelines more productive.
In mid-80s, microprocessor-based computers consistedof
An integer processingunit
A floating-pointunit
A cachecontroller
SRAMs for the cachedata
Tagstorage
As chip capacity increased, all these components were merged into a singlechip.
Thus, a single chip consisted of separate hardware for integer arithmetic, floating
point operations, memory operations and branchoperations.
Other than pipelining individual instructions, it fetches multiple instructions at a
time and sends them in parallel to different functional units whenever possible.
This type of instruction level parallelism is called superscalarexecution.
FLYNN‘S CLASSIFICATION
Flynn's taxonomy is a specific classification of parallel computer architectures that
are based on the number of concurrent instruction (single or multiple) and data
streams (single or multiple) available in thearchitecture.
The four categories in Flynn's taxonomy are thefollowing:
1. (SISD) single instruction, singledata
2. (SIMD) single instruction, multipledata
3. (MISD) multiple instruction, singledata
4. (MIMD) multiple instruction, multipledata
Instruction stream: is the sequence of instructions asexecuted by themachine
Data Stream is a sequence of data including input, or partial or temporary result,
called by the instructionStream.
Instructions are decoded by the control unit and then ctrl unit send the instructions
to the processing units for execution.•
Data Stream flows between the processors and memory bidirectionally.
SISD
An SISD computing system is a uniprocessor machine which is capable of executing a
single instruction, operating on a single datastream.
SIMD
• An SIMD system is a multiprocessor machine capable of executing the same
instruction on all the CPUs but operating on different datastreams
Machines based on an SIMD model are well suited to scientific computing since
they involve lots of vector and matrixoperations.
So that theinformation can be passed to all the processing elements (PEs)
organized data elements of vectors can be divided into multiple sets(N-sets for N PE
systems) and each PE can process one dataset.
Dominant representative SIMD systems is Cray’s vector processingmachine.
MISD
An MISD computing system is a multiprocessor machinecapable of executing
different instructions on different PEs but all of them operating on the same
dataset .
The system performs different operations on the same data set. Machines built
using the MISD model are not useful in most of the application, a few machines
are built, but none of them are availablecommercially.
MIMD
An MIMD system is a multiprocessor machine which is capable of executing
multiple instructions on multiple datasets.
Each PE in the MIMD model has separate instruction and data strea m s; therefore
machines built usingthism odel are capable to any kind ofapplication.
Unlike SIMD and MISD machines, PEs in MIMD mac h ines work
asynchronously.
dly categorized into
MIMD machines arebroa
shared-memoryMIMD and
distributed-memoryMIMD
based on the way PEs are coupled to the main memory.
In the shared memory MIMD model (tightly coupled multiprocessor systems), all the
PEs are connected to a single global memory and they all have access to it. The
communication between PEs in this model takes place through the shared memory,
modification of the data stored in the global memory by one PE is visible to all other PEs.
Dominant representative shared memory MIMD systems are Silicon Graphics machines
and Sun/IBM’s SMP (SymmetricMulti-Processing).
VECTORARCHITECTURES
A multithreaded CPU is not a parallel architecture, strictly speaking; multithreading
is obtained through a single CPU, but it allows a programmer to design and develop
applications as a set of programs that can virtually execute in parallel:
namely,threads.
Multithreading is solution to avoid waiting clock cycles as the missing data is
fetched: making the CPU manage more peer-threads concurrently; if a thread gets
blocked, the CPU can execute instructions of another thread, thus keeping
functional unitsbusy.
Each thread must have a private Program Counter and a set of private registers,
separate from otherthreads.
In a traditional scalar processor, the basic data type is an n-bitword.
The architecture often exposes a register file of words, and the instruction set is
composed of instructions that operate on individualwords.
In a vector architecture, there is support of a vector datatype, where a vector is a
collection of VL n-bit words (VL is the vectorlength).
There may also be a vector register file, which was a key innovation of the Cray
architecture.
Previously, vector machines operated on vectors stored in mainmemory.
Figures 1 and 2 illustrate the difference between vector and scalar data types, and
the operations that can be performed onthem.
Vector load/store instructions provide the ability to do strided and scatter / gather
memory accesses, which take data elements distributed throughout memory and
pack them into sequential vectors/streams placed in vector/streamregisters.
This promotes datalocality.
It results in less data pollution, since only useful data is loaded from the memory
system.
It provides latency tolerance because there can be many simultaneous outstanding
memoryaccesses.
Vector instructions such as VLD and VST provide thiscapability.
HARDWARE MULTITHREADING
Multithreading
• A mechanism by which the instruction streams is divided into several smaller
streams
(threads) and can be executed in parallel is calledmultithreading.
Hardware Multithreading
• Increasing utilization of a processor by switching to another thread when one
thread is stalled is known as hardwaremultithreading.
Thread
• A thread includes the program counter, the register state, and the
stack. It isa lightweight process; whereas threads commonly share a single
address space, processesdon't.
Thread Switch
• The act of switching processor control from one thread to another within
the same process. It is much less costly than a processorswitch.
Process
• A process includes one or more threads, the address space, and the
operating system state. Hence, a process switch usually invokes the operating
system, but not a threadswitch.
Types of Multi-threading
1. Fine-grainedMultithreading
2. Coarse-grainedMultithreading
3. SimultaneousMultithreading
Coarse-grained Multithreading
A version of hardware multithreading that implies switching between threads only
after significant events, such as a last-level cachemiss.
• This change relieves the need to have thread switching be extremely fast and
is much less likely to slow down the execution of an individual thread, since
instructions from other threads will only be issued when a thread encounters
a costlystall.
Advantage
• To have very fast threadswitching.
• Doesn't slow downthread.
Disadvantage
• It is hard to overcome throughput losses from shorter stalls, due to pipeline
start -upcosts.
• Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline
must beemptied.
• New thread must fill pipeline before instructions cancomplete.
• Due to this start-up overhead, coarse-grained multithreading is much more
useful for reducing the penalty of high-cost stalls, where pipeline refill is
negligible compared to the stalltime.
Fine-grained Multithreading
• A version of hardware multithreading that implies switching between threads
after every instruction resulting in interleaved execution of multiple threads. It
switches from one thread to another at each clockcycle.
• This interleaving is often done in a round-robin fashion, skipping any threads
that are stalled at that clockcycle.
To make fine-grained multithreading practical, the processor must be able to switch
threads on every clockcycle.
Advantage
• Vertical waste iseliminated.
• Pipeline hazards cannotarise.
• Zero switchingoverhead
• Ability to hide latency within a thread i.e., it can hide the throughput losses
that arise from both short and longstalls.
• Instructions from other threads can be executed when one threadstalls.
• High executionefficiency
• Potentially less complex than alternative high performanceprocessors.
Disadvantage
• Clock cycles are wasted if a thread has little operation toexecute.
• Needs a lot of threads toexecute.
• It is expensive than coarse-grainedmultithreading.
• It slows down the execution of the individual threads, since a thread that is
ready to execute without stalls will be delayed by instructions from other
threads.
Simultaneous multithreading (SMT)
• It is a variation on hardware multithreading that uses the resources of a
multiple-issue, dynamically scheduled pipelined processor to exploit thread-
level parallelism at the same time it exploitsinstruction level parallelism.
• The key insight that motivates SMT is that multiple-issue processors often
have more functional unit parallelism available than most single threads can
effectively use.
Since SMT relies on the existing dynamic mechanisms, it does not switch resources
every cycle.
• Instead, SMT is always executing instructions from multiple threads, to
associate instruction slots and renamed registers with their properthreads.
Advantage
• It is ability to boost utilization by dynamically scheduling functional
units among multiplethreads.
• It increases hardware designfacility.
• It produces better performance and add resources to a fine grainedmanner.
Disadvantage
It cannot improve performance if any of the shared resources are the limiting
bottlenecks for theperformance.
Physical memory uniformly shared by all processors, with equal access time to
all words.
• Processors may have ocal cache memories. Peripherals also shared in some
fashion.
• UMA architecture models are of two20types,
Symmetric:
• All processors have equal access to allperipheral
devices. All processors are identical.
Asymmetric:
• One processor (master) executes the operating system other
processors may be of different types and may be dedicated to
specialtasks.
Non Uniform Memory Access (NUMA) multiprocessors
• In shared memory multiprocessor systems, local memories can be connected
with every processor. The collections of all local memories form the global
memory beingshared.
• In this way, global memory is distributed to all the processors. In this case, the
access to a local memory is uniform for its corresponding processor as itisattached
to the localmemory.
• But if one reference is to the local memory of some other remote processor,
then the access is notuniform.
• It depends on the location of the memory. Thus, all memory words are not
accessed uniformly. All local memories form a global address space accessible
by allprocessors
• Programming NUMAs are harder but NUMAs can scale to larger sizes and
have lower latency to localmemory
• Memory is common to all the processors. Processors easily communicate by
means of sharedvariables.
• These systems differ in how the memory and peripheral resources are
shared ordistributed
• The access time varies with the location of the memoryword.
CLUSTER SYSTEM
ClusteredsystemsaresimilartoparallelsystemsastheybothhavemultipleCPUs.
However a major difference is that clustered systems are created by two or more
individual computer systems mergedtogether.
Basically, they have independent computer systems with a common storage and
the systems worktogether.
Each node in the clustered systems contains the cluster software. This software monitors the
cluster system and makes sure it is working as required. If any one of the nodes in the
clustered system fail, then the rest of the nodes take control of its storage and resources and
try torestart.
• In this case, all local memories are private and are accessible only to the local
processors.
• This is why, the traditional machines are called no-remote-memory-access
(NORMA) machines.
6. Post MCQ Test
7.Reference: