Unit III

UNIT III – PARALLEL AND MULTI-CORE PROCESSING
Parallel processing challenges – Flynn’s classification – SISD, MIMD, SIMD, SPMD, and Vector
Architectures - Hardware multithreading – Multi-core processors and other Shared Memory
Multiprocessors - Introduction to Graphics Processing Units, Clusters, Warehouse Scale Computers
and other Message-Passing Multiprocessors
INTRODUCTION
To fulfill increasing demands for higher performance, it is necessary to process data concurrently to achieve better
throughput instead of processing each instruction sequentially as in a conventional computer. Processing data
concurrently is known as parallel processing. There are two ways by which we can achieve parallelism. They are:
• Multiple Functional Units - System may have two or more ALUs so that they can execute two or more
instructions at the same time.
• Multiple Processors - System may have two or more processors operating concurrently.
There are several different forms of parallel computing: bit-level, instruction-level, data-level and task-level
parallelism.
Multiprocessors
A computer system with at least two or more processors is called multiprocessor system. The multiprocessor
software must be designed to work with a variable number of processors.
Features of Multiprocessor System:
o Better Performance
o Scalability
o Improve Availability / Reliability
o High Throughput
o Job-Level Parallelism/ Process-Level Parallelism
o Parallel Processing Program
Clusters
A set of computers connected over a local area network that function as a single large multiprocessor is called
a cluster.
Multicore Multiprocessors
A multicore is an architecture design that places multiple processors on a single die (computer chip) to
enhance performance and allow simultaneous processing of multiple tasks more efficiently. Each processor is called
a core.
Prepared by Ms.K.Sherin, AP/CSE/SJIT

INSTRUCTION LEVEL PARALLELISM (ILP)
ILP is a measure of how many operations in a computer program can be performed simultaneously. The potential
overlap among instructions is called instruction level parallelism. It is a technique which is used to overlap the
execution of instructions to improve performance. Pipelining is a technique that runs programs faster by overlapping
the execution of instructions. Pipelining is an example of instruction level parallelism.
Two methods of ILP
o Increasing the depth of pipeline
By increasing the depth of the pipeline, more instructions can be executed in parallel
simultaneously. The amount of parallelism being exploited is higher, since there are more operations
being overlapped.
o Multiple Issue
Multiple issue is a technique which replicates the internal components of the computer so
that it can launch multiple instructions in every pipeline stage. Launching multiple instructions per
stage will allow the instruction execution rate to exceed the clock rate or the CPI to be less than 1.
Types of Multiple issues
There are two major ways to implement a multiple issue processor such as,
• Static multiple Issues – It is an approach to implement a multiple issue processor where many decisions are
made statically by the compiler before execution.
• Dynamic Multiple Issues – It is an approach to implement a multiple issue processor where many decisions
are made during execution by the processor.
The Concept of Speculation
Speculation is an approach that allows the compiler or the processor to ‘guess’ about the properties of an
instruction, so as to enable execution to begin for other instructions that may depend on the speculated instruction.
Types of Speculation
1. Compiler based Speculation
The compiler can use speculation to reorder instructions, moving an instruction across a branch or a load
across a store. In compiler-based speculation, exception problems are avoided by adding special speculation support
that allows such exceptions to be ignored until it is clear that they really should occur.
Recovery Mechanism
In the case of speculation in software, the compiler usually inserts additional instructions that check
the accuracy of the speculation and provide a fix-up routine to use when the speculation is incorrect.
2. Hardware-based Speculation
The processor hardware can perform the same transformation i.e., reordering instructions at runtime. In
hardware-based speculation, exceptions are simply buffered until it is clear that the instruction causing them is no
longer speculative and is ready to complete; at that point the exception is raised, and normal execution handling
proceeds.

Recovery Mechanism
In hardware speculation, the processor usually buffers the speculative results until it knows they are
no longer speculative. If the speculation is correct, the instructions are completed by allowing the contents of
the buffers to be written to the registers or memory. If the speculation is incorrect, the hardware flushes the
buffers and re-executes the correct instruction sequence.
Difficulty /Drawback of speculation
Guess may be wrong
The difficulty with speculation is that the guess may be wrong. So, any speculation mechanism must include
both a method to check if the guess was right and a method to unroll or back out the effects of the instructions
that were executed speculatively. The implementation of this back-out capability adds complexity.
May introduce exception
Speculating on certain instructions may introduce exceptions that were formerly not present. For example,
suppose a load instruction is moved in a speculative manner, but the address it uses is not legal when the
speculation is incorrect. The result is that an exception that should not have occurred will occur. This problem is
complicated by the fact that if the load instruction were not speculative, then the exception must occur.
Static Multiple Issue Processors
Static multiple-issue processors all use the compiler to assist with packaging instructions and handling hazards.
Issue Packet
In a static issue processor, the set of instructions issued in a given clock cycle is called as issue packet. The packet
may be determined statically by the compiler or dynamically by the processor.
VLIW - Very Long Instruction Word
Since a static multiple-issue processor usually restricts what mix of instructions can be initiated in a given clock
cycle, it is useful to think of the issue packet as a single instruction allowing several operations in certain predefined
fields. VLIW is a style of instruction set architecture that launches many operations that are defined to be independent
in a single wide instruction, typically with many separate opcode fields.
Dealing with hazards
Compiler
Most static issue processors rely on the compiler to take on some responsibility for handling data and control
hazards. The compilers responsibilities may include static branch prediction and code scheduling to reduce or prevent
all hazards. In some designs, the compiler takes full responsibility for removing all hazards, scheduling the code and
inserting no-ops so that the code executes without any need for hazard detection or hardware-generated stalls.
Hardware
The hardware detects the data hazards and generates stalls between two issue packets, while requiring that
the compiler avoid all dependencies within an instruction pair. So, a hazard generally forces the entire issue packet
containing the dependent instruction to stall. Whether the software must handle all hazards or only try to reduce the
fraction of hazards between separate issue packets, the appearance of having a large single instruction with multiple
operations is reinforced.
Static two-issue pipeline in operation

To give a flavor of static multiple-issue, we will consider a simple two-issue MIPS processor, where one of
the instructions can be an integer ALU operation or branch and the other can be a load or store. Such a design requires
issuing of two instructions per cycle which performs fetching and decoding 64-bits of instructions.
Techniques for improving performance
To effectively exploit the parallelism available in a multiple-issue processor, more ambitious compiler or
hardware scheduling techniques are needed, and static multiple issue requires that the compiler take on this role.
1. Reordering the Instructions
In a static two-issue processor, the compiler attempts to reorder instructions to avoid stalling the
pipeline when branches or data dependencies between successive instructions occur. In doing so, the compiler
must ensure that reordering does not cause a change in the outcome of a computation. The objective is to
place useful instructions in these slots. If no useful instructions can be placed in the slots, then these slots
must be filled with ‘nop’ instructions. The dependency introduced by the condition-code flags reduces the
flexibility available for the compiler to reorder instructions.
Example
Loop: lw $t0, 0($s1);
addu $t0, $t0, $s2;
sw $t0, 0($s1);
addi $s1, $s1, -4;
bne $s1, $zero, Loop;
Reorder the instructions to avoid as many stalls as possible. Assume branches are predicted, so that control
hazards are handled by the hardware. The first three instructions have data dependencies, and so do the last two.
The scheduled code as it would look on a two-issue MIPS pipeline. The empty slots are ‘nop’.
Advantages

Reordering the instructions will reduce number of stalls and also it increases the performance of the processor.
2. Loop Unrolling
An important compiler technique to get more performance from loops is loop unrolling, where
multiple copies of the loop body are made. After unrolling, there is more ILP available by overlapping the
instructions from different iterations.
Loop unrolling is a technique to get more performance from loops that access arrays, in which
multiple copies of the loop body are made and instructions from different iterations are scheduled together.
Example
Loop: lw $t0, 0($s1);
addu $t0, $t0, $s2;
sw $t0, 0($s1);
addi $s1, $s1, -4;
bne $s1, $zero, Loop;
Let us see how well loop unrolling and scheduling work well in the above example. For simplicity assume that the
loop index is a multiple of 4.
To schedule the loop without any delays, it turns out that we need to make 4 copies of the loop body. After
unrolling and eliminating the unnecessary loop overhead instructions, the loop will contain four copies of lw, addu,
sw, addi and bne.
During the unrolling process, the compiler introduced additional registers ($t1, $t2, $t3). The goal of this
process, called register renaming, is to eliminate dependences that are not true data dependences, but could either lead
to potential hazards or prevent the compiler from flexibly scheduling the code.
Register Renaming
It is the process of renaming the registers by the compiler or hardware to remove antidependences.
Consider how the unrolled code would look using only $t0. There would be repeated instances of lw $t0,
0($s1), addu $t0, $t0, $s2 followed by sw $t0, 4($s1), but these sequences, despite using $t0, are actually completely
independent – no data values flows between one pair of these instructions and the next pair. This is what is called an
antidependences or name dependence, which is an ordering forced purely by the reuse of a name, rather than a real
data dependence which is also called a true dependence.

Name Dependence /Antidependence
It is an ordering forced by the reuse of a name, typically a register, rather than by a true dependence that
carries a value between two instructions.
Advantages
Renaming the registers during the unrolling process allows the compiler to move these independent
instructions subsequently so as to better schedule the code. The renaming process eliminates the name dependences,
while preserving the true dependences.
Dynamic Multiple-Issue Processor
Dynamic multiple-issue processors are also known as superscalar processors, or simply superscalars. In the
simplest superscalar processors, instructions issue in order, and the processor decides whether zero, one, or more
instructions can issue in a given clock cycle.
Superscalar Processor
Superscalar is an advanced pipelining technique that enables the processor to execute more than one instruction
per clock cycle by selecting them during execution.
Dynamic Pipeline Scheduling
Many superscalars extend the basic framework of dynamic issue decisions to include dynamic pipeline
scheduling. Dynamic pipeline scheduling chooses which instructions to execute in a given clock cycle while trying
to avoid hazards and stalls.
Three Major Units
Dynamic pipeline scheduling chooses which instructions to execute next, possibly by reordering them to
avoid stalls. In such processors, the pipeline is divided into three major units.
1. Instruction Fetch & issue unit
The first unit fetches instructions, decodes them, and sends each instruction to a corresponding
functional unit for execution.
2. Multiple Functional Units
Each functional unit has buffers, called reservation stations, which hold the operands and the
operation. As soon as the buffer contains all its operands and the functional unit is ready to execute, the result
is calculated. When the result is completed, it is sent to any reservation stations waiting for this particular
result as well as to the commit unit.
Reservation Stations- Reservation station is a buffer within a functional unit that holds the operands
and the operation.
3. Commit unit
It is a unit in a dynamic or out-of-order execution pipeline that decides when it is safe to release the
result of an operation to programmer-visible registers or memory. Commit unit buffers the result until it
is safe to put the result into the register file or into the memory.
Reorder Buffer

The buffer in the commit unit is often called the reorder buffer, is also used to supply
operands, in much the same way as forwarding logic does in a statically scheduled pipeline. Once a
result is committed to the register file, it can be fetched directly from there, just as in a normal
pipeline. It is a buffer which holds the results in a dynamically scheduled processor until it is safe to
store the results to memory or a register.
Dynamically Scheduled Pipeline – Diagram
Out- of- Order Execution

A dynamically scheduled pipeline can be used for analyzing the data flow structure of a program. The
processor then executes the instructions in some order that preserves the data flow order of the program. This style of
execution is called an out-of-order execution, since the instructions can be executed in a different order than they
were fetched.
In-Order Commit
In-order commit is a commit in which the results of pipelined execution are written to the programmer-visible
state in the same order that instructions are fetched.
PARALLEL PROCESSING CHALLENGES
Parallel processing will increase the performance of processor and it will reduce the utilization time to execute a
task. The difficulty with parallelism is not the hardware; it is that too few important application programs have been
rewritten to complete tasks sooner on multiprocessors.
It is difficult to write software that uses multiple processors to complete one task faster, and the problem gets
worse as the number of processors increases.
Difficulty in Developing Parallel Processing programs
Developing the parallel processing programs are so harder than the sequential programs because of the
following reasons:

1. Must get better Performance & Efficiency
The first reason is that you must get better performance and efficiency from a parallel processing program on a
multiprocessor; otherwise, you would just use a sequential program on a Uniprocessor, as programming is easier.
In fact, Uniprocessor design techniques such as superscalar and out-of-order execution take advantage of
instruction-level parallelism, normally without the involvement of the programmer. Such innovations reduced the
demand for rewriting programs for multiprocessors, since programmers could do nothing and yet their sequential
programs would run faster on new computers.
It is very difficult to write parallel processing programs that are fast, especially as the number of processors
increases. Because of the following reasons, we cannot get parallel processing programs faster than sequential
programs:
2. Scheduling
Scheduling is a method by which threads, processes or data flows are given access to system resources.
Scheduling is done to load balance and share system resources effectively or to achieve quality of service.
Scheduling can be done in various fields among that process scheduling is more important, because in parallel
processing we need to schedule the process correctly. Process scheduling can be done in the following ways:
1. Long term scheduling
2. Medium term scheduling
3. Short term scheduling
4. Dispatcher
3. Load Balancing
The task must be broken into equal number of pieces otherwise some task may be idle while waiting for the ones
with larger pieces to finish. To perform parallel processing, tasks must be shared equally to all the processor, then
only we can avoid the idle time of any processor.
Load balancing is the process of dividing the amount of work that a computer has to do between two or more
processor, so that more work gets done in the same amount of time and in general all process gets served faster. Work
load has to be distributed evenly between the processor to obtain the parallel processing task.
4. Time for Synchronization
Synchronization is the most important challenge in parallel processing, because all the processor have equal work
load so it must complete the task within the specified time period. For parallel processing program, it must have time
for synchronization process, since if any process does not complete the task within the specific time period then we
cannot able to achieve parallel processing.
5. Communication Overhead
Parallel processing is achieved only if there is an efficient communication between the multiple processors
available in the system. The result of a computation done in one processor may be required by another processor, so
the processor has to communicate the result of the computation by passing the result to the processor which requires
the result in order to proceed with the execution. So, if there is no proper and quick communication between different
processors, parallel processing performance will start to degrade.
6. Amdahl’s law

Amdahl’s law is used to calculate the performance gain that can be obtained by improving some portion of a
computer. It states that the performance improvement to be gained from using some faster mode of execution is
limited by the fraction of the time the faster mode can be used.
Amdahl’s law reminds us that even small parts of a program must be parallelized if the program is to make good
use of many cores.
Speed-up (Performance Improvement)
It tells us how much faster a task can be executed using the machine with the enhancement as compare to the
original machine. It is defined as
Speedup =
Amdahl’s law gives us a quick way to find the speed up from two factors: Fractionenhanced (Fe) and Speedupenhanced (Se).
It is given as
Therefore, Speedup =
Speedup =
Fractionenhanced (Fe)
It is the fraction of the computation time in the original machine that can be converted to take advantage of
the enhancement.
Speedupenhanced (Se)
It tells how much faster the task would run if the enhancement mode was used for the entire program.
Problems related to Amdahl’s Law:
1. Suppose you want to achieve a speed-up of 80 times fester with 100 processors. What percentage of the
original computation can be sequential?
Solution:
Given data’s, Speedup = 80, Speed Enhanced = Se = 100, Fe = ?
Amdahl’s law says that,
We can reformulate Amdahl’s law in terms of speed-up versus the original execution time:
This formula is usually rewritten assuming that the execution time before is 1 for some unit of time, and the
execution time affected by improvement is considered the fraction of the original execution time:

Speedup =
80 =
Solving for Fe, we have
0.8 x [100 – 99Fe] = 1
Thus to achieve a speedup of 80 from 100 processors, the sequential percentage can only be 0.3%.
1. Speed-up Challenge: Bigger Problem (Increase in Problem Size)
Suppose you want to perform two sums: one is a sum of 10 scalar variables and one is a matrix sum of a pair
of two-dimensional arrays, with dimensions 10 by 10.For now let‘s assume only the matrix sum is
parallelizable. What speed-up do you get with 10 versus 40 Processors? Next, calculate the speed-ups assuming
the matrices grow to 20 by 20.
If we assume performance is a function of the time for an addition, t, then there are 10 additions that do not
benefit from parallel processors and 100 additions that do. If the time for a single processor is 110 t, the execution
time for 10 processors is
Execution time affected by improvement
Execution time after improvement = + Execution time unaffected
Amount of improvement
Execution time for 10 processors

100t
Execution time after improvement = + 10t = 20t
10
110t
Speedup of 10 processor = = 5.5
20t
100t
Execution time after improvement = + 10t = 12.5t
40
110t
12.5t
5.5
Potential speedup with 10 processors = x100 = 55%
10
8.8
40
These examples show that getting good speed-up on a multiprocessor while keeping the problem size fixed is harder
than getting good speed-up by increasing the size of the problem.
(ii) Matrices grow by 20x20

400t
10
410t
50t
400t
40
410t
20t
8.2
10
20.5
40
Th us, for this larger problem size, we get 82% of the potential speed-up with 10 processors and 51% with 40.
Conclusion:
This examples show that getting good speed-up on a multiprocessor while keeping the problem size fixed is
harder than getting good speed-up by increasing the size of the problem.
This allows us to introduce two terms that describe ways to scale up.
1. Strong Scaling – Speedup achieved on a multiprocessor without increasing the size of the problem.
2. Weak Scaling – Speedup achieved on a multiprocessor while increasing the size of the problem
proportionally to the increase in the number of processors.
3. Speedup Challenge: Balancing Load
FLYNN’S CLASSIFICATION
Parallel processing can be classified in many ways. It can be classified according to the internal organization of
processors, according to the interconnection structure used between processors or according to the flow of information
through the system.
One such classification is introduced by Micheal J. Flynn. We know that a typical processing unit operates by
fetching instructions and operands from the main memory, executing the instructions, and placing the results in the
main memory. The steps associated with the processing of an instruction form an instruction cycle. The instruction
can be viewed as forming an instruction stream flowing from main memory to the processor, while the operands form
another stream, data stream, flowing to and from the processor.
Instruction Stream
Processor
(P) Memory
(M)
Data Stream
In 1996, Micheal J. Flynn has made an informal and widely used classification of processor parallelism based on
the number of simultaneous instruction and data streams seen by the processor during program execution.
The classification made by Micheal J. Flynn divides computers into four major groups:
• Single Instruction Stream – Single Data Stream (SISD)
• Single Instruction Stream – Multiple Data Stream (SIMD)
• Multiple Instruction Stream – Single Data Stream (MISD)
• Multiple Instruction Stream – Multiple Data Stream (MIMD)
Categorization based on No. of instruction streams & No. of Data streams
The following classification was based on the number of instruction streams and the number of data streams.
Thus, a conventional Uniprocessor has a single instruction stream and single data stream, and a conventional
multiprocessor has multiple instruction streams and multiple data streams.
(i) Single Instruction Stream Single Data Stream (SISD)

A single processor executes a single instruction stream to operate on data stored in a single memory.
Uniprocessors falls into this category. Most conventional machines with one CPU containing a single arithmetic logic
unit (ALU) capable of doing only scalar arithmetic fall into this category. SISD computers and sequential computers
are thus synonymous. In SISD computers, instructions are executed sequentially but may overlap in their execution
stages. They may have more than one functional unit, but all functional units are controlled by a single control unit.
(ii) Single Instruction Stream Multiple Data Stream (SIMD)

A single machine instruction controls the simultaneous execution of a number of processing elements on a
lockstep basis. This category corresponds to array processors. They have multiple processing / execution units and
one control unit. Therefore, all processing / execution units are supervised by the single control unit. Here, all

processing elements receive same instruction from the control unit but operate on different data sets from distinct data
elements.
SIMD computers exploit data level parallelism by applying the same operations to multiple items of the data
in parallel. Each processor has its own data memory but there is a single instruction memory and control processor,
which fetches and dispatches instructions. For applications that display significant data-level parallelism, the SIMD
approach can be very efficient. Vector architecture are the largest class of SIMD architecture.
An application is data parallel if it wants to do the same computation on lots of pieces of data, which typically
come from different squares in a grid. Examples include image processing, weather forecasting, and computational
fluid dynamics (e.g. simulating airflow around a car or inside a jet engine).
Providing more than one arithmetic logic unit (ALU) that can all operate in parallel on different inputs,
providing the same operation, is an example of SIMD. This can be achieved by using multiple input buses in the CPU
for each ALU that load data from multiple registers. The processor's control unit sends the same command to each of
the ALUs to process the data and the results may be stored, again using multiple output buses. Machines that provide
vector operations are classified as SIMD. In this case a single instruction is simultaneously applied to a vector.
Advantages of SIMD
• Reduces the cost of control unit over dozens of execution units.
• It has reduced instruction bandwidth and program memory.
• It needs only one copy of the code that is being executed simultaneously.
• SIMD works best when dealing with arrays in ‘for’ loops. Hence, for parallelism to work in SIMD, there
must be a great deal of identically structured data, which is called data-level parallelism.
Disadvantages of SIMD
• SIMD is at its weakest in case or switch statements, where each execution unit must perform a different
operation on its data, depending on what data it has.
• Execution units with the wrong data are disabled, so that units with proper data may continue. Such situation
essentially run at 1/nth performance, where ‘n’ is the number of cases.

(iii) Multiple Instruction Stream Single Data Stream (MISD)
A sequence of data is transmitted to a set of processors, each of which executes a different instruction
sequence. This structure is not commercially implemented. Not many parallel processors fit well into this category.
In MISD, there are ‘n’ processor units, each receiving distinct instructions operating over the same data stream and
its derivatives. The results of one processor become the input of the next processor in the micropipe. The fault-tolerant
computers where several processing units process the same data using different programs belong to the MISD class.
The results of such apparently redundant computations can be compared and used to detect and eliminate faulty
results.
(iv) Multiple Instruction Stream Multiple Data Stream (MIMD)

A set of processors simultaneously execute different instruction sequences on different data sets. SMP’s,
Clusters and NUMA systems fits into this category. Most multiprocessors system and multiple computers system can
be classified into this category. In MIMD, there are more than one processor unit having the ability to execute several
programs simultaneously. MIMD computer implies interactions among the multiple processors because all memory
streams are derived from the same data space shared by all processors.
In MIMD, each processor fetches its own instructions and operates on its own data. MIMD computers exploit
thread-level parallelism, since multiple threads operates in parallel. MIMDs offers flexibility with correct hardware
and software support, MIMDs can function as single-user processors focusing on high performance for one
application, as multi-programmed multiprocessors running many tasks simultaneously.
Most multiprocessors today on the market are (shared memory) MIMD machines. They are built out of
standard processors and standard memory chips, interconnected by a fast bus (memory is interleaved). If the
processor's control unit can send different instructions to each ALU in parallel then the architecture is MIMD. A
superscalar architecture is also MIMD. In this case there are multiple execution units so that multiple instructions can
be issued in parallel.

VECTOR ARCHITECTURE
An older and more elegant interpretation of SIMD is called a vector architecture, which has been closely
identified with Cray computers. It is again a great match to problems with lots of data-level parallelism. Rather than
having 64 ALUs perform 64 additions simultaneously, like the old array processors, the vector architectures pipelined
the ALU to get good performance at lower cost.
The basic philosophy of vector architecture is to collect data elements from memory, put them in order into
a large set of registers, operate on them sequentially in registers, and then write the results back to memory. A key
feature of vector architectures is a set of vector registers. Thus, vector architecture might have 32 vector registers,
each with 64 64-bit elements.
Vector elements are independent and it can be operated on in parallel. All modern vector computers have
vector functional units with multiple parallel pipelines called vector lanes. A vector functional unit with parallel
pipelines produces two or more results per clock cycle.
Advantages of vector processors
• Vector processor greatly reduces the dynamic instruction bandwidth, executing only six instructions versus
almost 600 for MIPS.
• The reduction in instructions fetched and executed, saves power.
• Frequency of occurrence of pipeline hazards is reduced.
• On the vector processor, each vector instruction will only stall for the first element in each vector, and then
subsequent elements will flow smoothly down the pipeline. Thus, pipeline stalls are required only once per
vector operation, rather than once per vector element.
• The pipeline stalls can be reduced on MIPS by using loop-unrolling.
Vector vs Scalar
Vector instructions have several important properties compared to conventional instruction set architectures,
which are called scalar architectures in this context:

• A single vector instruction specifies a great deal of work – it is equivalent to executing an entire loop. The
instructions fetch and decode bandwidth needed is dramatically reduced.
• By using a vector instruction, the compiler or programmer indicates that the computation of each result in the
vector is independent of the computation of other results in the same vector, so hardware does not have to
check for data hazards within a vector instruction.
• Vector architectures and compilers have a reputation of making it much easier than MIMD multiprocessors
to write efficient applications when they contain data-level parallelism.
• Hardware need only check for data hazards between two vector instructions once per vector operand, not
once for every element within the vectors. Reduced checking can save power as well.
• Vector instructions that access memory have a known access pattern. If the vector’s elements are all adjacent,
then fetching the vector from a set of heavily interleaved memory banks works very well. Thus, the cost of
the latency to main memory is seen only once for the entire vector, rather than once for each word of the
vector.
• Because an entire loop is replaced by a vector instruction whose behavior is predetermined, control hazards
that would normally arise from the loop branch are nonexistent.
• The savings in instruction bandwidth and hazard checking plus the efficient use of memory bandwidth give
vector architectures advantages in power and energy versus scalar architectures.
Sl. No. Vector Architecture Multimedia Extensions
1 It specifies dozens of operations It specifies a few operations
Number of elements in a vector Number of elements in a multimedia
2
operation is not in the opcode extension operation is in the opcode
In vectors, data transfers need not be In multimedia extensions, data transfers
3
contagious need to be contagious
4 It specifies multiple operations It also specifies multiple operations
It easily captures the flexibility in data It also easily captures the flexibility in
5
widths data widths
6 It is easier to evolve over time It is complex to evolve over time.
HARDWARE MULTITHREADING
Multithreading
Multithreading is a higher-level parallelism called thread-level parallelism (TLP) because it is logically
structured as separate threads of execution.
When pipelining is used, it is essential to maximize the utilization of each pipeline stage to improve
throughput. It can be accomplished by executing some instructions in a different order rather than executing them
sequentially as they occur in the instruction stream and initiating execution of some instructions even though it is not

required. However, this approach needs more complex mechanisms in the design. The designer cannot cross the
limitations of circuit complexity and power consumption. Therefore, another approach is used, called multithreading.
In multithreading, the instruction stream is divided into several smaller streams, called threads, such that the
threads can be executed in parallel. Here, a high degree of instruction-level parallelism can be achieved without
increasing the circuit complexity or power consumption.
Thread – Level Parallelism
Unlike instruction-level parallelism, which exploits implicit parallel operations within a loop or straight-line
code segment, thread-level parallelism is explicitly represented by the use of multiple threads of execution that are
inherently parallel.
Thread-level parallelism is an important alternative to instruction-level parallelism primarily because it could
be more cost-effective to exploit than instruction-level parallelism. There are many important applications where
thread-level parallelism occurs naturally, as it does in many server applications.
Hardware Multithreading
Hardware multithreading allows multiple threads to share the functional units of a single processor in an
overlapping fashion. To permit this sharing, the processor must duplicate the independent state of each thread. For
example, each thread would have a separate copy of the register file and the PC. The memory itself can be shared
through the virtual memory mechanisms, which already support multiprogramming.
In addition, the hardware must support the ability to change to a different thread relatively quickly. In
particular, a thread switch should be much more efficient than a process switch, which typically requires hundreds to
thousands of processor cycles while a thread switch can be instantaneous.
Different Approaches of H/W Multithreading
There are two main approaches to hardware multithreading. They are
1. Fine-grained multithreading
2. Coarse-grained multithreading
3. Simultaneous multithreading
(i) Fine Grained Multithreading
Fine-grained multithreading is a version of hardware multithreading that suggests switching between threads
after every instruction. It switches between threads on each instruction, resulting in interleaved execution of multiple
threads.
The processor executes two or more threads at a time. It switches from one thread to another at each clock
cycle. During execution, if a thread is blocked because of data dependencies or memory latencies, then that thread is
skipped and a ready-thread is executed.
This interleaving is often done in a round-robin fashion, skipping any threads that are stalled at that time. To
make fine-grained multithreading practical, the processor must be able to switch threads on every clock cycle.
Example
Thread 1→A1 A2 A3 A4 A5 A6
Thread 2→B1 B2 B3 B4 B5 B6
Thread 3→ C1 C2 C3 C4 C5 C6

Control flow→ A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4 A5 B5 C5 A6 B6 C6
If Stall occurs at A2 and A5
Control flow→ A1 B1 C1 A2 B2 C2 B3 C3 B4 C4 A3 B5 C5 A4 B6 C6 A5 A6
Example
If a Stall occurs at A2
Control flow→A1 A2 B1 B2 B3 A3 A4 A5 A6 B4 B5 B6
Advantage
One key advantage of fine-grained multithreading is that it can hide the throughput losses that arise from both short
and long stalls, since instructions from other threads can be executed when one thread stalls.
Disadvantage
The primary disadvantage of fine-grained multithreading is that it slows down the execution of the individual threads,
since a thread that is ready to execute without stalls will be delayed by instructions from other threads.
(ii) Coarse –Grained Multithreading
Coarse-grained multithreading is a version of hardware multithreading that suggests switching between
threads only after significant events, such as a cache miss. It switches threads only on costly stall like second-level
cache misses.
The processor executes instructions of a thread sequentially and if an event that causes any delay occurs, it
switches to another thread.
This change relieves the need to have thread switching be essentially free and is much less likely to slow
down the execution of an individual thread, since instructions from other threads will only be issued when a thread
encounters a costly stall.
Example
Control flow→ A1 A2 A3 A4 A5 A6 B1 B2 B3 B4 B5 B6 C1 C2 C3 C4 C5 C6
Example
If a Stall occurs at A2
Control flow→A1 A2 B1 B2 B3 A3 A4 A5 A6 B4 B5 B6
Advantage
• Coarse-grained multithreading is much more useful for reducing the penalty of high-cost stalls, where
pipeline refill is negligible compared to the stall time.
• It relieves the need to have very fast thread-switching.

• Doesn’t slow down the execution of thread, since the instructions from other threads is issued only
when the thread encounters a costly stall.
Disadvantage
• Coarse-grained multithreading is limited in its ability to overcome throughput losses, especially from
shorter stalls, due to pipeline start-up costs.
• Since a processor with coarse-grained multithreading issues instructions from a single thread, when
a stall occurs, the pipeline must be emptied or frozen.
• The new thread that begins executing after the stall must fill the pipeline before instructions will be
able to complete.
(iii) Simultaneous Multithreading (SMT)
Simultaneous multithreading is a version of multithreading that lowers the cost of multithreading by utilizing
the resources needed for multiple issue, dynamically scheduled micro-architecture. The wide superscalar instruction
is executed by executing multiple threads simultaneously using multiple execution units of a superscalar processor.
It is a variation on hardware multithreading that uses the resources of a multiple-issue, dynamically scheduled
processor to exploit thread-level parallelism at the same time it exploits instruction-level parallelism. The key insight
that motivates SMT is that multiple-issue processors often have more functional unit parallelism available than a
single thread can effectively use.
Example
Thread 4→ D1 D2 D3 D4 D5 D6
Control Flow in a Dual Core processor
Core 1: A1 B1 A2 B2 A3 B3 A4 B4 A5 B5 A6 B6
Core 2: C1 D1 C2 D2 C3 D3 C4 D4 C5 D5 C6 D6
Control Flow in a Quad Core processor
Core 1: A1 A2 A3 A4 A5 A6
Core 2: B1 B2 B3 B4 B5 B6
Core 3: C1 C2 C3 C4 C5 C6
Core 4: D1 D2 D3 D4 D5 D6
Control Flow in a Octo Core processor
Core 1: A1 A2 A3
Core 2: A4 A5 A6
Core 3: B1 B2 B3
Core 4: B4 B5 B6
Core 5: C1 C2 C3
Core 6: C4 C5 C6

Core 7: D1 D2 D3
Core 8: D4 D5 D6
Advantages
• Simultaneous Multithreaded Architecture is superior in performance to a multiple-issue multiprocessor
(multiple-issue CMP).
• SMP boosts utilization by dynamically scheduling functional units among multiple threads.
• SMT also increases hardware design flexibility.
• SMT increases the complexity of instruction scheduling.
• With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued
without regard to the dependences among them; the resolution of the dependences can be handled by the
dynamic scheduling capability.
• Since you are relying on the existing dynamic mechanisms, SMT does not switch resources every cycle.
Instead, SMT is always executing instructions from multiple threads, leaving it up to the hardware to associate
instruction slots and renamed registers with their proper threads.
In the simultaneous multithreading case, thread-level parallelism and instruction-level parallelism are both
exploited, with multiple threads using the issue slots in a single clock cycle. Ideally, the issue slot usage is limited by
imbalances in the resource needs and resource availability over multiple threads. In practice, other factors can restrict
how many slots are used. Although the above diagram greatly simplifies the real operation of these processors, it does
illustrate the potential performance advantages of multithreading in general and SMT in particular.
MULTI CORE PROCESSOR
A multicore design takes several processor cores and packages them as a single processor. The goal is to
enable the system to run more tasks simultaneously and thereby achieve greater overall system performance.
Given the difficulty of rewriting old programs to run well on parallel hardware, a natural question is what
computer designers can do to simplify the task.
One answer was to provide a single physical address space that all processors can share, so that programs
need not concern themselves with where they run, merely that they may be executed in parallel. In this approach, all
variables of a program can be made available at any time to any processor. When the physical address space is
common – which is usually the case for multicore chips – then the hardware typically provides cache coherence to
give a consistent view of the shared memory.
The alternative is to have a separate address space per processor that requires that sharing must be explicit.
Multi-Core processors have multiple execution cores on a single chip
In this design, each core has its own execution pipeline. And each core has the resources required to run
without blocking resources needed by the other software threads. It is classified according to number of processors it
contains,
Multi-core processors may have
• Two cores
o Dual-core CPUs

o Example AMD Phenom II X2 and Intel Core Duo
• Three cores
o Tri-core CPUs
o Example AMD Phenom II X3
• Four cores
o Quad-core CPUs
o Example AMD Phenom II X4, Intel's i5 and i7 processors
• Six cores
o Hexa-core CPUs
o Example AMD Phenom II X6 and Intel Core i7 Extreme Edition 980X
• Eight cores
o Octo-core CPUs
o Example Intel Xeon E7-2820 and AMD FX-8350
• Ten cores
o Example, Intel Xeon E7-2850) or more
Applications
Multi-core processors are widely used across many application domains including
• General-purpose
• Embedded
• Network
• Digital Signal Processing (DSP)
• Graphics
SHARED MEMORY MULTIPROCESSOR (SMP)

SMP is a parallel processor with a single address space, implying implicit communication with loads and
stores. If offers the programmer a single address space across all processors, although a more accurate term would
have been shared-address multiprocessor. Such systems can still run independent jobs in their own virtual address
space, even if they all share a physical address space. Processors communicate through shared variables in memory,
with all processors capable of accessing any memory location via loads and stores.
A multiprocessor system consists of a number of processors capable of simultaneously executing

independent tasks. A task may encompass a few instructions for one pass through a loop, or thousands of instructions
executed in a subroutine. In a shared-memory multiprocessor, all processors have access to the same memory. Tasks
running in different processors can access shared variables in the memory using the same addresses. The size of the
shared memory is likely to be large. Implementing a large memory in a single module would create a bottleneck when
many processors make requests to access the memory simultaneously. An interconnection network enables any

processor to access any module that is a part of the shared memory. When memory modules are kept physically
separate from the processors, all requests to access memory must pass through the network, which introduces latency.
Two styles of SMP

1. Uniform Memory Access (UMA) Multiprocessor
This multiprocessor takes about the same time to access main memory no matter which processor requests it
and no matter which word is requested. Such machines are called uniform memory access (UMA) multiprocessors.
Although the latency is uniform, it may be large for a network that connects many processors and memory modules.
For better performance, it is desirable to place a memory module close to each processor. The result is a collection of
nodes, each consisting of a processor and a memory module.
2. Non- Uniform Memory Access (NUMA) Multiprocessors

In this multiprocessor, some memory accesses are much faster than others, depending on which processor asks
for which word. Memory access time is depend on the memory locations relative to a processor. Different access
times for local and remote accesses. Such machines are called non-uniform memory access (NUMA) multiprocessors.
The programming challenges are harder for a NUMA multiprocessor than for a UMA multiprocessor, but NUMA
machines can scale to larger sizes and NUMAs can have lower latency to nearby memory.

Synchronization
It is the process of coordinating the behavior of two or more processes, which may be running on different
processors. As processors operating in parallel will normally share data, they also need to coordinate when operating
on shared data; otherwise, one processor could start working on data before another is finished with it. This
coordination is called synchronization.
Using Locks
It is a synchronization technique that allows access to data to only one processor at a time. When sharing is
supported with a single address space, there must be a separate mechanism for synchronization. One approach uses a
lock for a shared variable. Only one processor at a time can acquire the lock, and the other processors interested in
shared data must wait until the original processor unlocks the variable.
Advantages
• Each processor has its own local memory system
• More total bandwidth in the memory system than in a centralized memory system
• The latency to complete a memory request is lower─ each processor‘s memory is located physically close
to it
Disadvantages
• Only some of the data in the memory is directly accessible by each processor, since a processor can only
read and write its local memory system
• Requires communication through the network and leads to the coherence problem─ major source of
complexity in shared-memory systems
• Possible that could exist in different processors‘ memories
•Leads to different processors having different values for the same variable
INTRODUCTION TO GRAPHICS PROCESSING UNITS
The increasing demands of processing for computer graphics has led to the development of specialized chips
called graphics processing units (GPUs). The primary purpose of GPUs is to accelerate the large number of floating-
point calculations needed in high-resolution three-dimensional graphics, such as in video games. Since the operations
involved in these calculations are often independent, a large GPU chip contains hundreds of simple cores with floating-
point ALUs to perform them in parallel.

A GPU chip and a dedicated memory for it are included on a video card. Such a card is plugged into an expansion
slot of a host computer using an interconnection standard such as the PCIe standard. A small program is written for
the processing cores in the GPU chip and a large number of cores execute this program in parallel. The cores execute
the same instructions, but operate on different data elements. A separate controlling program runs in the general-
purpose processor of the host computer and invokes the GPU program when necessary. Before initiating the GPU
computation, the program in the host computer must first transfer the data needed by the GPU program from the main
memory into the dedicated GPU memory. After the computation is completed, the resulting output data in the
dedicated memory are transferred back to the main memory.The processing cores in a GPU chip have a specialized
instruction set and hardware architecture, which are different from those used in a general-purpose processor.
• An example is the Compute Unified Device Architecture (CUDA) that NVIDIA Corporation uses for the
cores in its GPU chips.
Key characteristics as to how GPUs vary from CPUs
GPUs are accelerators that supplement a CPU, so they do not need be able to perform all the tasks of a CPU.
This role allows them to dedicate all their resources to graphics. It's fine for GPUs to perform some tasks poorly or not
at all, given that in a system with both a CPU and a GPU, the CPU can do them if needed.The GPU problems sizes
are typically hundreds of megabytes to gigabytes, but not hundreds of gigabytes to terabytes.
These differences led to different styles of architecture:
• The biggest difference is that GPUs do not rely on multilevel caches to overcome the long latency to memory, as
do CPUs. Instead, GPUs rely on hardware multithreading to hide the latency to memory.
• The GPU memory is oriented toward bandwidth rather than latency. There are even special graphics DRAM
chips for GPUs that are wider and have higher bandwidth than DRAM chips for CPUs. In addition, GPU memories
have traditionally had smaller main memories than conventional microprocessors
• Given the reliance on many threads to deliver good memory bandwidth, GPUs can accommodate many parallel
processors (MIMD) as well as many threads. Hence, each GPU processor is more highly multithreaded than a
typical CPU, plus they have more processors.
An Introduction to the NVIDIA GPU Architecture
Figure shows a simplifi ed block diagram of a multithreaded SIMD processor. Dropping down one more level of
detail, the machine object that the hardware creates, manages, schedules, and executes is a thread of SIMD
instructions, which also call a SIMD thread. It is a traditional thread, but it contains exclusively SIMD instructions.
These SIMD threads have their own program counters and they run on a multithreaded SIMD processor.

GPU Memory structures
GPU Memory is shared by the vectorized loops. All threads of SIMD instructions within a thread block share Local
Memory. The Figure shows the memory structures of an NVIDIA GPU.
• On- chip memory that is local to each multithreaded SIMD processor Local Memory. It is shared by the SIMD
Lanes within a multithreaded SIMD processor, but this memory is not shared between multithreaded SIMD
processors.
• Off- chip DRAM shared by the whole GPU and all thread blocks GPU Memory. Rather than rely on large caches
to contain the whole working sets of an application, GPUs traditionally use smaller streaming caches and rely
on extensive multithreading of threads of SIMD instructions to hide the long latency to DRAM, since their working
sets can be hundreds of megabytes.
NVIDIA GPU Memory Structures

Similarities and differences between m ul t i c ore with multimedia SIMD extensions and recent GPUs.
CLUSTERS
Clusters are collections of computers that are connected to each other using their I/O interconnect via standard
network switches and cables to form a message-passing multiprocessor. Each runs a distinct copy of the operating
system. Virtually every internet service relies on clusters of commodity servers and switches.
Drawbacks of cluster
o Administration cost – The cost of administering a cluster of n machines is about the same as the cost of
administering n independent machines, while the cost of administering a shared memory multiprocessor
with n processors is about the same as administering a single machine.
o Performance degradation – The processors in a cluster are usually connected using the I/O interconnect
of each computer; whereas the cores in a multiprocessor are usually connected on the memory
interconnect of the computer. The memory interconnect has higher bandwidth and lower latency,
allowing much better communication performance.
o Division of memory – A cluster of n machines has n independent memories and n copies of the operating
system, but a shared memory multiprocessor allows a single program to use almost all the memory in the
computer, and it only needs a single copy of the OS.
Advantages of Clusters
1. High availability – Since a cluster consists of independent computers connected through a local area network,
it is much easier to replace a machine without bringing down the system in cluster than in an SMP.
2. Scalable – Given that clusters are constructed from whole computers and independent, scalable networks, this
isolation also makes it easier to expand the system without bringing down the application that runs on top of
the cluster.
3. Low cost
4. Improve power efficiency – Clusters consumes less power and works efficiently.
Examples
The search engines that millions of us use every day depend upon this technology. eBay, Google,
Microsoft, Yahoo, and others all have multiple datacenters each with clusters of tens of thousands of processors.

MESSAGE PASSING MULTIPROCESSORS
An alternative approach to sharing an address space is that each processor can have their own private address
space. This alternative multiprocessor must communicate via explicit message passing, which is traditionally the
name of such style of computers, provided the system has routines to send and receive messages. Coordination is built
in with message passing, since one processor knows when a message is sent, and the receiving processor knows when
a message arrives. If the sender needs confirmation that the message has arrived, the receiving processor can then
send an acknowledgement message back to the sender.
Classic Organization of Multiprocessor with multiple private address space (or)
Message passing
Message passing is nothing but communication between multiple processors by explicitly sending and
receiving information.
Send Message Routine
A routine used by a processor in machines with private memories to pass to another processor.
Receive Message Routine
A routine used by a processor in machines with private memories to accept a message from another processor.
Some concurrent applications run well on parallel hardware, independent of whether it offers shared addresses
or message passing. In particular, job-level parallelism and applications with little communication – like web search,
mail servers, and file servers – do not require shared addressing to run well.
Advantages
There were several attempts to build high-performance computers based on high-performance message-
passing networks, and they did offer better absolute communication performance than clusters built using local area
networks.
Disadvantages
The problem was that they were much more expensive. Few applications could justify the higher
communication performance, given the much higher costs.
WAREHOUSE-SCALE COMPUTERS
Warehouse-scale computers (WSCs) form the foundation of internet services that people use for search, social
networking, online maps, video sharing, online shopping, email, cloud computing, etc. The ever increasing popularity
of internet services has necessitated the creation of WSCs in order to keep up with the growing demands of the public.

Although WSCs may seem to be large datacenters, their architecture and operation are different from datacenters. The
WSC is a descendant of the supercomputer.
WSCs as Servers
The following features of WSCs that makes it work as servers:
Cost-performance: Because of the scalability, the cost-performance becomes very critical. Even small savings can
amount to a large amount of money.
Energy efficiency: Since large numbers of systems are clustered, lot of money is invested in power distribution and
for heat dissipation. Work done per joule is critical for both WSCs and servers because of the high cost of building
the power and mechanical infrastructure for a warehouse of computers and for the monthly utility bills to power
servers.
If servers are not energy efficient they will increase
• cost of electricity
• cost of infrastructure to provide electricity
• cost of infrastructure to cool the servers.
Dependability via redundancy: The hardware and software in a WSC must collectively provide at least 99.99%
availability, while individual servers are much less reliable. Multiple WSCs may be needed to handle faults in
wholeWSCs. MultipleWSCs also reduce latency for services that are widely deployed.
Network I/O: Networking is needed to interface to the public as well as to keep data consistent between multiple
WSCs.
Interactive and batch-processing workloads: Search and social networks are interactive and require fast response
times. The WSC workloads must be designed to tolerate large numbers of component faults without affecting the
overall performance and availability.
The following features of WSCs make them different from servers:
Ample parallelism: Servers need not to worry about the parallelism available in applications to justify the amount
of parallel hardware. But in WSCs most jobs are totally independent and exploit request-level parallelism.
Request-Level parallelism (RLP) is a way of representing tasks which are set of requests which are to be to run in
parallel. Interactive internet service applications, the workload consists of independent requests of millions of users.
Also, the data of many batch applications can be processed in independent chunks, exploiting data-level parallelism.
Operational costs count: Server architects normally design systems for peak performance with in a cost budget.
Power concerns are not too much as long as the cooling requirements are maintained. The operational costs are
ignored. WSCs, however, have a longer life times and the building, electrical and cooling costs are very high. So, the
operational costs cannot be ignored. All these add up to more than 30% of the costs of a WSC in 10 years. Power
consumption is a primary, not secondary constraint when designing the WSC system.
Scale and its opportunities and problems: The WSCs are massive internally, so it gets volume discounts and
economy of scale, even if there are not too many WSCs. On the other hand, customized hardware for WSCs can be
very expensive, particularly if only small numbers are manufactured. The economies ofscale lead to cloud

computing,since the lower per-unit costs of WSCs lead to lower rental rates. Even if a server had a Mean Time To
Failure (MTTF) of twenty five years, the WSC architectshould design for five server failures per day.
Programming model for WSC
There is a high variability in performance between the different WSC servers because of:
• varying load on servers
• file may or may not be in a file cache
• distance over network can vary
• hardware anomalies
A WSC will start backup executions on other nodes when tasks have not yet completed and take the result that
finishes first. Rely on data replication to help with read performance and availability. A WSC also has to cope with
variability in load. Often WSC services are performed with in-house software to reduce costs and optimize for
performance.
Storage of WSC
A WSC uses local disks inside the servers as opposed to network attached storage (NAS). The Google file
system (GFS) uses local disks and maintains at least three replicas to improve dependability by covering not only disk
failures, but also power failures to a rack or a cluster ofracks by placing the replicas on different clusters. A read is
serviced by one of the three replicas, but a write has to go to all three replicas. Google uses a relaxed consistency model
in that all three replicas have to eventually match, but not all at the same time.

Unit III

Uploaded by

Copyright:

Available Formats

Unit III

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit III

Uploaded by

Copyright:

Available Formats

UNIT III – PARALLEL AND MULTI-CORE PROCESSING

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Out- of- Order Execution

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Solving for Fe, we have

0.8 x [100 – 99Fe] = 1

Execution time for 10 processors

Prepared by Ms.K.Sherin, AP/CSE/SJIT

(i) Single Instruction Stream Single Data Stream (SISD)

(ii) Single Instruction Stream Multiple Data Stream (SIMD)

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

(iv) Multiple Instruction Stream Multiple Data Stream (MIMD)

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

4 It specifies multiple operations It also specifies multiple operations

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

SHARED MEMORY MULTIPROCESSOR (SMP)

A multiprocessor system consists of a number of processors capable of simultaneously executing

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Two styles of SMP

2. Non- Uniform Memory Access (NUMA) Multiprocessors

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

NVIDIA GPU Memory Structures

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

Prepared by Ms.K.Sherin, AP/CSE/SJIT

You might also like