0% found this document useful (0 votes)

109 views

1 Introduction To Parallel Computing

Uploaded by

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views

1 Introduction To Parallel Computing

Uploaded by

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 58

Introduction to Parallel

Computing
What is Parallel Computing? (1)

 Traditionally, software has been written for serial

computation:
– To be run on a single computer having a single Central
Processing Unit (CPU);
– A problem is broken into a discrete series of instructions.
– Instructions are executed one after another.
– Only one instruction may execute at any moment in time.

Page 2 Introduction to High Performance Computing

What is Parallel Computing? (2)

 In the simplest sense, parallel computing is the simultaneous use of

multiple compute resources to solve a computational problem.
– To be run using multiple CPUs
– A problem is broken into discrete parts that can be solved concurrently
– Each part is further broken down to a series of instructions
 Instructions from each part execute simultaneously on different CPUs

Page 3 Introduction to High Performance Computing

Parallel Computing: Resources

 The compute resources can include:

– A single computer with multiple processors;
– A single computer with (multiple) processor(s) and some
specialized computer resources (GPU, FPGA …)
– An arbitrary number of computers connected by a network;
– A combination of both.

Page 4 Introduction to High Performance Computing

Parallel Computing: The computational problem

 The computational problem usually demonstrates

characteristics such as the ability to be:
– Broken apart into discrete pieces of work that can be solved
simultaneously;
– Execute multiple program instructions at any moment in
time;
– Solved in less time with multiple compute resources than
with a single compute resource.

Page 5 Introduction to High Performance Computing

Parallel Computing: what for? (1)

 Parallel computing is an evolution of serial computing that

attempts to emulate what has always been the state of affairs in
the natural world: many complex, interrelated events happening
at the same time, yet within a sequence.
 Some examples:
– Planetary and galactic orbits
– Weather and ocean patterns
– Tectonic plate drift
– Rush hour traffic in Paris
– Automobile assembly line
– Daily operations within a business
– Building a shopping mall
– Ordering a hamburger at the drive through.

Page 6 Introduction to High Performance Computing

Parallel Computing: what for? (2)

 Traditionally, parallel computing has been considered

to be "the high end of computing" and has been
motivated by numerical simulations of complex
systems and "Grand Challenge Problems" such as:
– weather and climate
– chemical and nuclear reactions
– biological, human genome
– geological, seismic activity
– mechanical devices - from prosthetics to spacecraft
– electronic circuits
– manufacturing processes

Page 7 Introduction to High Performance Computing

Parallel Computing: what for? (3)

 Today, commercial applications are providing an equal or

greater driving force in the development of faster computers.
These applications require the processing of large amounts of
data in sophisticated ways. Example applications include:
– parallel databases, data mining
– oil exploration
– web search engines, web based business services
– computer-aided diagnosis in medicine
– management of national and multi-national corporations
– advanced graphics and virtual reality, particularly in the
entertainment industry
– networked video and multi-media technologies
– collaborative work environments
 Ultimately, parallel computing is an attempt to maximize the
infinite but seemingly scarce commodity called time.

Page 8 Introduction to High Performance Computing

Why Parallel Computing? (1)

 This is a legitime question! Parallel computing is

complex on any aspect!

 The primary reasons for using parallel computing:

– Save time - wall clock time
– Solve larger problems
– Provide concurrency (do multiple things at the same time)

Page 9 Introduction to High Performance Computing

Why Parallel Computing? (2)

 Other reasons might include:

– Taking advantage of non-local resources - using available
compute resources on a wide area network, or even the
Internet when local compute resources are scarce.
– Cost savings - using multiple "cheap" computing resources
instead of paying for time on a supercomputer.
– Overcoming memory constraints - single computers have
very finite memory resources. For large problems, using the
memories of multiple computers may overcome this
obstacle.

Page 10 Introduction to High Performance Computing

Limitations of Serial Computing

 Limits to serial computing - both physical and practical reasons pose

significant constraints to simply building ever faster serial computers.
 Transmission speeds - the speed of a serial computer is directly
dependent upon how fast data can move through hardware. Absolute
limits are the speed of light (30 cm/nanosecond) and the transmission
limit of copper wire (9 cm/nanosecond). Increasing speeds necessitate
increasing proximity of processing elements.
 Limits to miniaturization - processor technology is allowing an
increasing number of transistors to be placed on a chip. However, even
with molecular or atomic-level components, a limit will be reached on
how small components can be.
 Economic limitations - it is increasingly expensive to make a single
processor faster. Using a larger number of moderately fast commodity
processors to achieve the same (or better) performance is less
expensive.

Page 11 Introduction to High Performance Computing

The future

 during the past 10 years, the trends indicated by ever

faster networks, distributed systems, and multi-
processor computer architectures (even at the
desktop level) clearly show that parallelism is the
future of computing.
 It will be multi-forms, mixing general purpose
solutions (your PC…) and very speciliazed solutions
as IBM Cells, ClearSpeed, GPGPU from NVidia …

Page 12 Introduction to High Performance Computing

Who and What? (1)

 Top500.org provides statistics on parallel computing

users - the charts below are just a sample. Some
things to note:
– Sectors may overlap - for example, research may be
classified research. Respondents have to choose between
the two.
 "Not Specified" is by far the largest application -
probably means multiple applications.

Page 13 Introduction to High Performance Computing

Who and What? (2)

Page 14 Introduction to High Performance Computing

Concepts and Terminology
Von Neumann Architecture

 For over 40 years, virtually all computers have

followed a common machine model known as the von
Neumann computer. Named after the Hungarian
mathematician John von Neumann.

 A von Neumann computer uses the stored-program

concept. The CPU executes a stored program that
specifies a sequence of read and write operations on
the memory.

Page 16 Introduction to High Performance Computing

Basic Design

 Basic design
– Memory is used to store both
program and data instructions
– Program instructions are coded data
which tell the computer to do
something
– Data is simply information to be used
by the program
 A central processing unit (CPU)
gets instructions and/or data from
memory, decodes the instructions
and then sequentially performs
them.

Page 17 Introduction to High Performance Computing

Flynn's Classical Taxonomy

 There are different ways to classify parallel

computers. One of the more widely used
classifications, in use since 1966, is called Flynn's
Taxonomy.
 Flynn's taxonomy distinguishes multi-processor
computer architectures according to how they can be
classified along the two independent dimensions of
Instruction and Data. Each of these dimensions can
have only one of two possible states: Single or
Multiple.

Page 18 Introduction to High Performance Computing

Flynn Matrix

 The matrix below defines the 4 possible

classifications according to Flynn

Page 19 Introduction to High Performance Computing

Single Instruction, Single Data (SISD)

 A serial (non-parallel) computer

 Single instruction: only one instruction
stream is being acted on by the CPU
during any one clock cycle
 Single data: only one data stream is
being used as input during any one clock
cycle
 Deterministic execution
 This is the oldest and until recently, the
most prevalent form of computer
 Examples: most PCs, single CPU
workstations and mainframes

Page 20 Introduction to High Performance Computing

Single Instruction, Multiple Data (SIMD)
 A type of parallel computer
 Single instruction: All processing units execute the same instruction at any given
clock cycle
 Multiple data: Each processing unit can operate on a different data element
 This type of machine typically has an instruction dispatcher, a very high-
bandwidth internal network, and a very large array of very small-capacity
instruction units.
 Best suited for specialized problems characterized by a high degree of
regularity,such as image processing.
 Synchronous (lockstep) and deterministic execution
 Two varieties: Processor Arrays and Vector Pipelines
 Examples:
– Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2
– Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi S820

Page 21 Introduction to High Performance Computing

Multiple Instruction, Single Data (MISD)
 A single data stream is fed into multiple processing units.
 Each processing unit operates on the data independently via
independent instruction streams.
 Few actual examples of this class of parallel computer have
ever existed. One is the experimental Carnegie-Mellon C.mmp
computer (1971).
 Some conceivable uses might be:
– multiple frequency filters operating on a single signal stream
 multiple cryptography algorithms attempting to crack a single
coded message.

Page 22 Introduction to High Performance Computing

Multiple Instruction, Multiple Data (MIMD)
 Currently, the most common type of parallel computer. Most
modern computers fall into this category.
 Multiple Instruction: every processor may be executing a
different instruction stream
 Multiple Data: every processor may be working with a different
data stream
 Execution can be synchronous or asynchronous, deterministic
or non-deterministic
 Examples: most current supercomputers, networked parallel
computer "grids" and multi-processor SMP computers -
including some types of PCs.

Page 23 Introduction to High Performance Computing

Some General Parallel Terminology

Like everything else, parallel computing has its own "jargon". Some of the
more commonly used terms associated with parallel computing are listed
below. Most of these will be discussed in more detail later.

 Task
– A logically discrete section of computational work. A task is
typically a program or program-like set of instructions that is
executed by a processor.
 Parallel Task
– A task that can be executed by multiple processors safely
(yields correct results)
 Serial Execution
– Execution of a program sequentially, one statement at a
time. In the simplest sense, this is what happens on a one
processor machine. However, virtually all parallel tasks will
have sections of a parallel program that must be executed
serially.

Page 24 Introduction to High Performance Computing

 Parallel Execution
– Execution of a program by more than one task, with each task
being able to execute the same or different statement at the same
moment in time.
 Shared Memory
– From a strictly hardware point of view, describes a computer
architecture where all processors have direct (usually bus based)
access to common physical memory. In a programming sense, it
describes a model where parallel tasks all have the same "picture"
of memory and can directly address and access the same logical
memory locations regardless of where the physical memory
actually exists.
 Distributed Memory
– In hardware, refers to network based memory access for physical
memory that is not common. As a programming model, tasks can
only logically "see" local machine memory and must use
communications to access memory on other machines where other
tasks are executing.

Page 25 Introduction to High Performance Computing

 Communications
– Parallel tasks typically need to exchange data. There are several
ways this can be accomplished, such as through a shared memory
bus or over a network, however the actual event of data exchange
is commonly referred to as communications regardless of the
method employed.
 Synchronization
– The coordination of parallel tasks in real time, very often associated
with communications. Often implemented by establishing a
synchronization point within an application where a task may not
proceed further until another task(s) reaches the same or logically
equivalent point.
– Synchronization usually involves waiting by at least one task, and
can therefore cause a parallel application's wall clock execution
time to increase.

Page 26 Introduction to High Performance Computing

 Granularity
– In parallel computing, granularity is a qualitative measure of the
ratio of computation to communication.
– Coarse: relatively large amounts of computational work are done
between communication events
– Fine: relatively small amounts of computational work are done
between communication events
 Observed Speedup
– Observed speedup of a code which has been parallelized, defined
as:
wall-clock time of serial execution
wall-clock time of parallel execution
– One of the simplest and most widely used indicators for a parallel
program's performance.

Page 27 Introduction to High Performance Computing

 Parallel Overhead
– The amount of time required to coordinate parallel tasks, as
opposed to doing useful work. Parallel overhead can include factors
such as:
 Task start-up time
 Synchronizations
 Data communications
 Software overhead imposed by parallel compilers, libraries, tools,
operating system, etc.
 Task termination time
 Massively Parallel
– Refers to the hardware that comprises a given parallel system -
having many processors. The meaning of many keeps increasing,
but currently BG/L pushes this number to 6 digits.

Page 28 Introduction to High Performance Computing

 Scalability
– Refers to a parallel system's (hardware and/or software)
ability to demonstrate a proportionate increase in parallel
speedup with the addition of more processors. Factors that
contribute to scalability include:
 Hardware - particularly memory-cpu bandwidths and network
communications
 Application algorithm
 Parallel overhead related
 Characteristics of your specific application and coding

Page 29 Introduction to High Performance Computing

Parallel Computer Memory Architectures
Memory architectures

 Shared Memory
 Distributed Memory
 Hybrid Distributed-Shared Memory

Page 31 Introduction to High Performance Computing

Shared Memory
 Shared memory parallel computers vary widely, but generally
have in common the ability for all processors to access all
memory as global address space.

 Multiple processors can operate independently but share the

same memory resources.
 Changes in a memory location effected by one processor are
visible to all other processors.
 Shared memory machines can be divided into two main classes
based upon memory access times: UMA and NUMA.
Page 32 Introduction to High Performance Computing
Shared Memory : UMA vs. NUMA

 Uniform Memory Access (UMA):

– Most commonly represented today by Symmetric Multiprocessor
(SMP) machines
– Identical processors
– Equal access and access times to memory
– Sometimes called CC-UMA - Cache Coherent UMA. Cache
coherent means if one processor updates a location in shared
memory, all the other processors know about the update. Cache
coherency is accomplished at the hardware level.
 Non-Uniform Memory Access (NUMA):
– Often made by physically linking two or more SMPs
– One SMP can directly access memory of another SMP
– Not all processors have equal access time to all memories
– Memory access across link is slower
– If cache coherency is maintained, then may also be called CC-
NUMA - Cache Coherent NUMA

Page 33 Introduction to High Performance Computing

Shared Memory: Pro and Con

 Advantages
– Global address space provides a user-friendly programming
perspective to memory
– Data sharing between tasks is both fast and uniform due to the
proximity of memory to CPUs
 Disadvantages:
– Primary disadvantage is the lack of scalability between memory
and CPUs. Adding more CPUs can geometrically increases traffic
on the shared memory-CPU path, and for cache coherent systems,
geometrically increase traffic associated with cache/memory
management.
– Programmer responsibility for synchronization constructs that
insure "correct" access of global memory.
– Expense: it becomes increasingly difficult and expensive to design
and produce shared memory machines with ever increasing
numbers of processors.

Page 34 Introduction to High Performance Computing

Distributed Memory
 Like shared memory systems, distributed memory systems vary widely but
share a common characteristic. Distributed memory systems require a
communication network to connect inter-processor memory.
 Processors have their own local memory. Memory addresses in one processor
do not map to another processor, so there is no concept of global address space
across all processors.
 Because each processor has its own local memory, it operates independently.
Changes it makes to its local memory have no effect on the memory of other
processors. Hence, the concept of cache coherency does not apply.
 When a processor needs access to data in another processor, it is usually the
task of the programmer to explicitly define how and when data is communicated.
Synchronization between tasks is likewise the programmer's responsibility.
 The network "fabric" used for data transfer varies widely, though it can can be as
simple as Ethernet.

Page 35 Introduction to High Performance Computing

Distributed Memory: Pro and Con

 Advantages
– Memory is scalable with number of processors. Increase the
number of processors and the size of memory increases
proportionately.
– Each processor can rapidly access its own memory without
interference and without the overhead incurred with trying to
maintain cache coherency.
– Cost effectiveness: can use commodity, off-the-shelf processors
and networking.
 Disadvantages
– The programmer is responsible for many of the details associated
with data communication between processors.
– It may be difficult to map existing data structures, based on global
memory, to this memory organization.
– Non-uniform memory access (NUMA) times

Page 36 Introduction to High Performance Computing

Hybrid Distributed-Shared Memory
Summarizing a few of the key characteristics of shared and
distributed memory machines
Comparison of Shared and Distributed Memory Architectures

Architecture CC-UMA CC-NUMA Distributed

Examples SMPs Bull NovaScale Cray T3E

Sun Vexx SGI Origin Maspar
DEC/Compaq Sequent IBM SP2
SGI Challenge HP Exemplar IBM BlueGene
IBM POWER3 DEC/Compaq
IBM POWER4 (MCM)
Communications MPI MPI MPI
Threads Threads
OpenMP OpenMP
shmem shmem
Scalability to 10s of processors to 100s of processors to 1000s of processors

Draw Backs Memory-CPU bandwidth Memory-CPU bandwidth System administration

Non-uniform access Programming is hard to
times develop and maintain
Software Availability many 1000s ISVs many 1000s ISVs 100s ISVs

Page 37 Introduction to High Performance Computing

Hybrid Distributed-Shared Memory
 The largest and fastest computers in the world today employ both
shared and distributed memory architectures.

 The shared memory component is usually a cache coherent SMP

machine. Processors on a given SMP can address that machine's
memory as global.
 The distributed memory component is the networking of multiple SMPs.
SMPs know only about their own memory - not the memory on another
SMP. Therefore, network communications are required to move data
from one SMP to another.
 Current trends seem to indicate that this type of memory architecture
will continue to prevail and increase at the high end of computing for
the foreseeable future.
 Advantages and Disadvantages: whatever is common to both shared
and distributed memory architectures.

Page 38 Introduction to High Performance Computing

Parallel Programming Models
 Overview
 Shared Memory Model
 Threads Model
 Message Passing Model
 Data Parallel Model
 Other Models

Page 40 Introduction to High Performance Computing

Overview

 There are several parallel programming models in

common use:
– Shared Memory
– Threads
– Message Passing
– Data Parallel
– Hybrid
 Parallel programming models exist as an abstraction
above hardware and memory architectures.

Page 41 Introduction to High Performance Computing

Overview

 Although it might not seem apparent, these models are NOT specific to
a particular type of machine or memory architecture. In fact, any of
these models can (theoretically) be implemented on any underlying
hardware.
 Shared memory model on a distributed memory machine: Kendall
Square Research (KSR) ALLCACHE approach.
– Machine memory was physically distributed, but appeared to the user as a
single shared memory (global address space). Generically, this approach is
referred to as "virtual shared memory".
– Note: although KSR is no longer in business, there is no reason to suggest
that a similar implementation will not be made available by another vendor
in the future.
– Message passing model on a shared memory machine: MPI on SGI Origin.
 The SGI Origin employed the CC-NUMA type of shared memory
architecture, where every task has direct access to global memory.
However, the ability to send and receive messages with MPI, as is
commonly done over a network of distributed memory machines, is not
only implemented but is very commonly used.

Page 42 Introduction to High Performance Computing

Overview

 Which model to use is often a combination of what is

available and personal choice. There is no "best"
model, although there certainly are better
implementations of some models over others.
 The following sections describe each of the models
mentioned above, and also discuss some of their
actual implementations.

Page 43 Introduction to High Performance Computing

Shared Memory Model

 In the shared-memory programming model, tasks share a

common address space, which they read and write
asynchronously.
 Various mechanisms such as locks / semaphores may be used
to control access to the shared memory.
 An advantage of this model from the programmer's point of view
is that the notion of data "ownership" is lacking, so there is no
need to specify explicitly the communication of data between
tasks. Program development can often be simplified.
 An important disadvantage in terms of performance is that it
becomes more difficult to understand and manage data locality.

Page 44 Introduction to High Performance Computing

Shared Memory Model: Implementations

 On shared memory platforms, the native compilers

translate user program variables into actual memory
addresses, which are global.
 No common distributed memory platform
implementations currently exist. However, as
mentioned previously in the Overview section, the
KSR ALLCACHE approach provided a shared
memory view of data even though the physical
memory of the machine was distributed.

Page 45 Introduction to High Performance Computing

Threads Model

 In the threads model of parallel programming, a single process can have

multiple, concurrent execution paths.
 Perhaps the most simple analogy that can be used to describe threads is the
concept of a single program that includes a number of subroutines:
– The main program a.out is scheduled to run by the native operating system. a.out
loads and acquires all of the necessary system and user resources to run.
– a.out performs some serial work, and then creates a number of tasks (threads) that
can be scheduled and run by the operating system concurrently.
– Each thread has local data, but also, shares the entire resources of a.out. This
saves the overhead associated with replicating a program's resources for each thread.
Each thread also benefits from a global memory view because it shares the memory
space of a.out.
– A thread's work may best be described as a subroutine within the main program. Any
thread can execute any subroutine at the same time as other threads.
– Threads communicate with each other through global memory (updating address
locations). This requires synchronization constructs to insure that more than one
thread is not updating the same global address at any time.
– Threads can come and go, but a.out remains present to provide the necessary shared
resources until the application has completed.
 Threads are commonly associated with shared memory architectures and
operating systems.

Page 46 Introduction to High Performance Computing

Threads Model Implementations

 From a programming perspective, threads implementations commonly comprise:

– A library of subroutines that are called from within parallel source code
– A set of compiler directives imbedded in either serial or parallel source code
 In both cases, the programmer is responsible for determining all parallelism.
 Threaded implementations are not new in computing. Historically, hardware
vendors have implemented their own proprietary versions of threads. These
implementations differed substantially from each other making it difficult for
programmers to develop portable threaded applications.
 Unrelated standardization efforts have resulted in two very different
implementations of threads: POSIX Threads and OpenMP.
 POSIX Threads
– Library based; requires parallel coding
– Specified by the IEEE POSIX 1003.1c standard (1995).
– C Language only
– Commonly referred to as Pthreads.
– Most hardware vendors now offer Pthreads in addition to their proprietary threads
implementations.
– Very explicit parallelism; requires significant programmer attention to detail.

Page 47 Introduction to High Performance Computing

Threads Model: OpenMP

 OpenMP
– Compiler directive based; can use serial code
– Jointly defined and endorsed by a group of major computer
hardware and software vendors. The OpenMP Fortran API was
released October 28, 1997. The C/C++ API was released in late
1998.
– Portable / multi-platform, including Unix and Windows NT platforms
– Available in C/C++ and Fortran implementations
– Can be very easy and simple to use - provides for "incremental
parallelism"
 Microsoft has its own implementation for threads, which is not
related to the UNIX POSIX standard or OpenMP.

Page 48 Introduction to High Performance Computing

Message Passing Model

 The message passing model demonstrates the

following characteristics:
– A set of tasks that use their own local memory during
computation. Multiple tasks can reside on the same physical
machine as well across an arbitrary number of machines.
– Tasks exchange data through communications by sending
and receiving messages.
– Data transfer usually requires cooperative operations to be
performed by each process. For example, a send operation
must have a matching receive operation.

Page 49 Introduction to High Performance Computing

Message Passing Model Implementations: MPI

 From a programming perspective, message passing

implementations commonly comprise a library of subroutines
that are imbedded in source code. The programmer is
responsible for determining all parallelism.
 Historically, a variety of message passing libraries have been
available since the 1980s. These implementations differed
substantially from each other making it difficult for programmers
to develop portable applications.
 In 1992, the MPI Forum was formed with the primary goal of
establishing a standard interface for message passing
implementations.
 Part 1 of the Message Passing Interface (MPI) was released in
1994. Part 2 (MPI-2) was released in 1996. Both MPI
specifications are available on the web at
www.mcs.anl.gov/Projects/mpi/standard.html.

Page 50 Introduction to High Performance Computing

Message Passing Model Implementations: MPI
 MPI is now the "de facto" industry standard for message passing,
replacing virtually all other message passing implementations used for
production work. Most, if not all of the popular parallel computing
platforms offer at least one implementation of MPI. A few offer a full
implementation of MPI-2.
 For shared memory architectures, MPI implementations usually don't
use a network for task communications. Instead, they use shared
memory (memory copies) for performance reasons.

Page 51 Introduction to High Performance Computing

Data Parallel Model

 The data parallel model demonstrates the following

characteristics:
– Most of the parallel work focuses on performing operations on a
data set. The data set is typically organized into a common
structure, such as an array or cube.
– A set of tasks work collectively on the same data structure,
however, each task works on a different partition of the same data
structure.
– Tasks perform the same operation on their partition of work, for
example, "add 4 to every array element".
 On shared memory architectures, all tasks may have access to
the data structure through global memory. On distributed
memory architectures the data structure is split up and resides
as "chunks" in the local memory of each task.

Page 52 Introduction to High Performance Computing

Data Parallel Model Implementations

 Programming with the data parallel model is usually

accomplished by writing a program with data parallel constructs.
The constructs can be calls to a data parallel subroutine library
or, compiler directives recognized by a data parallel compiler.
 Fortran 90 and 95 (F90, F95): ISO/ANSI standard extensions
to Fortran 77.
– Contains everything that is in Fortran 77
– New source code format; additions to character set
– Additions to program structure and commands
– Variable additions - methods and arguments
– Pointers and dynamic memory allocation added
– Array processing (arrays treated as objects) added
– Recursive and new intrinsic functions added
– Many other new features
 Implementations are available for most common parallel
platforms.

Page 53 Introduction to High Performance Computing

Data Parallel Model Implementations
 High Performance Fortran (HPF): Extensions to Fortran 90 to
support data parallel programming.
– Contains everything in Fortran 90
– Directives to tell compiler how to distribute data added
– Assertions that can improve optimization of generated code added
– Data parallel constructs added (now part of Fortran 95)
– Implementations are available for most common parallel platforms.
 Compiler Directives: Allow the programmer to specify the
distribution and alignment of data. Fortran implementations are
available for most common parallel platforms.
 Distributed memory implementations of this model usually have
the compiler convert the program into standard code with calls
to a message passing library (MPI usually) to distribute the data
to all the processes. All message passing is done invisibly to the
programmer.

Page 54 Introduction to High Performance Computing

Other Models

 Other parallel programming models besides those previously

mentioned certainly exist, and will continue to evolve along with
the ever changing world of computer hardware and software.
 Only three of the more common ones are mentioned here.
– Hybrid
– Single Program Multiple Data
– Multiple Program Multiple Data

Page 55 Introduction to High Performance Computing

Hybryd

 In this model, any two or more parallel programming models are

combined.
 Currently, a common example of a hybrid model is the
combination of the message passing model (MPI) with either the
threads model (POSIX threads) or the shared memory model
(OpenMP). This hybrid model lends itself well to the increasingly
common hardware environment of networked SMP machines.
 Another common example of a hybrid model is combining data
parallel with message passing. As mentioned in the data parallel
model section previously, data parallel implementations (F90,
HPF) on distributed memory architectures actually use message
passing to transmit data between tasks, transparently to the
programmer.

Page 56 Introduction to High Performance Computing

Single Program Multiple Data (SPMD)

 Single Program Multiple Data (SPMD):

 SPMD is actually a "high level" programming model that can be
built upon any combination of the previously mentioned parallel
programming models.
 A single program is executed by all tasks simultaneously.
 At any moment in time, tasks can be executing the same or
different instructions within the same program.
 SPMD programs usually have the necessary logic programmed
into them to allow different tasks to branch or conditionally
execute only those parts of the program they are designed to
execute. That is, tasks do not necessarily have to execute the
entire program - perhaps only a portion of it.
 All tasks may use different data

Page 57 Introduction to High Performance Computing

Multiple Program Multiple Data (MPMD)

 Multiple Program Multiple Data (MPMD):

 Like SPMD, MPMD is actually a "high level"
programming model that can be built upon any
combination of the previously mentioned parallel
programming models.
 MPMD applications typically have multiple executable
object files (programs). While the application is being
run in parallel, each task can be executing the same
or different program as other tasks.
 All tasks may use different data

Page 58 Introduction to High Performance Computing

Transportation Cross Docking - EWM
No ratings yet
Transportation Cross Docking - EWM
5 pages
Simple Network Management Protocol (SNMP)
100% (1)
Simple Network Management Protocol (SNMP)
16 pages
Mandler y Sarason
No ratings yet
Mandler y Sarason
8 pages
Accident Detection and Alert System
33% (6)
Accident Detection and Alert System
15 pages
Intro Parallel Computing PDF
No ratings yet
Intro Parallel Computing PDF
58 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
No ratings yet
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
65 pages
High Performance Computing: What Is It Used For and Why?
No ratings yet
High Performance Computing: What Is It Used For and Why?
19 pages
Unit1 Parallel and Distributed
No ratings yet
Unit1 Parallel and Distributed
21 pages
pdc2: MODULE2
No ratings yet
pdc2: MODULE2
113 pages
Solved Assignment - Parallel Processing
63% (8)
Solved Assignment - Parallel Processing
29 pages
Computer Architecture
No ratings yet
Computer Architecture
12 pages
1.parallel Processing
100% (7)
1.parallel Processing
20 pages
Nvidia DGX A100 Datasheet
No ratings yet
Nvidia DGX A100 Datasheet
2 pages
Compute Unified Device Architecture
No ratings yet
Compute Unified Device Architecture
6 pages
CUDA Installation Guide Windows
100% (1)
CUDA Installation Guide Windows
17 pages
Cuda Lab Manual
100% (1)
Cuda Lab Manual
22 pages
General System Architecture
No ratings yet
General System Architecture
28 pages
Introduction To Parallel Computing
100% (1)
Introduction To Parallel Computing
34 pages
High Performance Computing Lecture 1 HPC Public
No ratings yet
High Performance Computing Lecture 1 HPC Public
50 pages
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
No ratings yet
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
58 pages
Lecture Week - 1 Introduction 1 - SP-24
No ratings yet
Lecture Week - 1 Introduction 1 - SP-24
51 pages
Introduction To High Performance Scientific Computing
No ratings yet
Introduction To High Performance Scientific Computing
464 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
54 pages
Bhawini NLP Practical
No ratings yet
Bhawini NLP Practical
98 pages
Parallel and Distributed Computing Lecture 03
No ratings yet
Parallel and Distributed Computing Lecture 03
44 pages
Fallsem2019-20 Cse4001 Eth Vl2019201001348 Reference Material Cse4001 Parallel and Distributed Computing May 2019 (003) 18
No ratings yet
Fallsem2019-20 Cse4001 Eth Vl2019201001348 Reference Material Cse4001 Parallel and Distributed Computing May 2019 (003) 18
4 pages
Miss Nasreen Anjum: Artificial Intelligence (AI)
No ratings yet
Miss Nasreen Anjum: Artificial Intelligence (AI)
21 pages
Image Rotation Using CUDA
No ratings yet
Image Rotation Using CUDA
18 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
12 pages
NVIDIA OpenCL JumpStart Guide
No ratings yet
NVIDIA OpenCL JumpStart Guide
15 pages
High Performance Computing: Course Introduction
No ratings yet
High Performance Computing: Course Introduction
32 pages
Hill CH 1 Ed 3
No ratings yet
Hill CH 1 Ed 3
60 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
Subject Name Parallel and Distributed Computing
100% (1)
Subject Name Parallel and Distributed Computing
3 pages
ECMWF Advanced GPU Topics 1
100% (1)
ECMWF Advanced GPU Topics 1
59 pages
CUDA Installation Guide Windows
No ratings yet
CUDA Installation Guide Windows
28 pages
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
No ratings yet
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
38 pages
High Performance Computing (HPC)
No ratings yet
High Performance Computing (HPC)
8 pages
Microsoft in High Performance Computing: An Introduction: Aditya Krishnan Technical Product Manager Microsoft Corp
No ratings yet
Microsoft in High Performance Computing: An Introduction: Aditya Krishnan Technical Product Manager Microsoft Corp
21 pages
Parallel Computing
100% (1)
Parallel Computing
53 pages
Operating System Lab Manual-3
No ratings yet
Operating System Lab Manual-3
68 pages
NVIDIA CUDA Computational Finance Geeks3D
No ratings yet
NVIDIA CUDA Computational Finance Geeks3D
39 pages
High Performance Computing Cloud Computing
100% (1)
High Performance Computing Cloud Computing
49 pages
Module 2 Class 1
No ratings yet
Module 2 Class 1
9 pages
Java Concurrency
No ratings yet
Java Concurrency
70 pages
Cuda 9 and Beyond
100% (1)
Cuda 9 and Beyond
45 pages
Full Download Physically Based Rendering From Theory to Implementation 4th edition Matt Pharr PDF DOCX
100% (2)
Full Download Physically Based Rendering From Theory to Implementation 4th edition Matt Pharr PDF DOCX
50 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
Nvidia Cuda Arc
No ratings yet
Nvidia Cuda Arc
16 pages
Mpi Openmp Examples
No ratings yet
Mpi Openmp Examples
27 pages
Convolutional Neural Networks (Part I)
No ratings yet
Convolutional Neural Networks (Part I)
61 pages
Parallel Computing Simply in Depth by Ajit Singh PDF
No ratings yet
Parallel Computing Simply in Depth by Ajit Singh PDF
125 pages
Full Download Generative Artificial Intelligence Exploring the Power and Potential of Generative AI 1st Edition Shivam R Solanki PDF DOCX
100% (2)
Full Download Generative Artificial Intelligence Exploring the Power and Potential of Generative AI 1st Edition Shivam R Solanki PDF DOCX
78 pages
CC 1 Unit Notes
No ratings yet
CC 1 Unit Notes
8 pages
Data Base and Management System Lab
No ratings yet
Data Base and Management System Lab
9 pages
Audio Sample Selection With Generative Adversarial Networks
No ratings yet
Audio Sample Selection With Generative Adversarial Networks
80 pages
PDF
100% (2)
PDF
39 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
An Introduction To Programming Physics-Informed Neural Network-Based Computational Solid Mechanics
No ratings yet
An Introduction To Programming Physics-Informed Neural Network-Based Computational Solid Mechanics
32 pages
Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood
From Everand
Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood
Supun Kamburugamuve
No ratings yet
Lecture 1 - Introduction To Parallel Computing
No ratings yet
Lecture 1 - Introduction To Parallel Computing
32 pages
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
No ratings yet
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
31 pages
2 - Parallel Computer Architecture - 1
No ratings yet
2 - Parallel Computer Architecture - 1
26 pages
5 - Designing Parallel Programs
No ratings yet
5 - Designing Parallel Programs
52 pages
Authorisation: by M.O. Odeo
No ratings yet
Authorisation: by M.O. Odeo
33 pages
Classical Encryption Techniques: M. Odeo Lecturer
No ratings yet
Classical Encryption Techniques: M. Odeo Lecturer
39 pages
Computer Security & Cryptography: Bye M.O. Odeo
No ratings yet
Computer Security & Cryptography: Bye M.O. Odeo
22 pages
Controlling Stepper Motor Using 8051 Microcontroller
No ratings yet
Controlling Stepper Motor Using 8051 Microcontroller
14 pages
SEM Workshop Presentation 4
No ratings yet
SEM Workshop Presentation 4
27 pages
Hourly Analysis Program HVAC System
No ratings yet
Hourly Analysis Program HVAC System
9 pages
Properties of Malaysian Fired Clay Bricks and Their Evaluation With International Masonry Specifications
0% (1)
Properties of Malaysian Fired Clay Bricks and Their Evaluation With International Masonry Specifications
229 pages
STEPS-MAILMERGE FULL
No ratings yet
STEPS-MAILMERGE FULL
10 pages
Simple Practice Problems On Numbers-1
No ratings yet
Simple Practice Problems On Numbers-1
3 pages
Read/Write Base Station U2270B: Features
No ratings yet
Read/Write Base Station U2270B: Features
18 pages
Interest Rate Futures and Options OpenGamma
No ratings yet
Interest Rate Futures and Options OpenGamma
7 pages
Samacheer Kalvi Tenth Book Back One Word Questions English Medium
No ratings yet
Samacheer Kalvi Tenth Book Back One Word Questions English Medium
22 pages
Inverters List
No ratings yet
Inverters List
6 pages
Yuxuan's 4 Look Last Layer Tutorial
100% (1)
Yuxuan's 4 Look Last Layer Tutorial
4 pages
1st PT Math6 TQ
No ratings yet
1st PT Math6 TQ
4 pages
(Lab Manual) Chemistry Laboratory
No ratings yet
(Lab Manual) Chemistry Laboratory
79 pages
CSE Syllabus
No ratings yet
CSE Syllabus
88 pages
Midsem Solutions
No ratings yet
Midsem Solutions
8 pages
Autodata Training - Course Guide - 2020
100% (1)
Autodata Training - Course Guide - 2020
23 pages
ASTM B249 (Length and Dim Tol)
No ratings yet
ASTM B249 (Length and Dim Tol)
13 pages
Wooden Medicine Cabinets
No ratings yet
Wooden Medicine Cabinets
12 pages
400kV Line Terminals-Oil To SF6 Bushing Drawings
No ratings yet
400kV Line Terminals-Oil To SF6 Bushing Drawings
2 pages
Air Condition Size Calculator (1.1.19)
No ratings yet
Air Condition Size Calculator (1.1.19)
5 pages
B. Exegetical Bible Study Methods F. Schaeffer
100% (2)
B. Exegetical Bible Study Methods F. Schaeffer
26 pages
Chemical Kinetics RDR
No ratings yet
Chemical Kinetics RDR
2 pages
Ficha Tecnica YGFC-En 2012
No ratings yet
Ficha Tecnica YGFC-En 2012
27 pages
Chapter 2 - Circuit Elements
No ratings yet
Chapter 2 - Circuit Elements
15 pages
DISTILLATION
No ratings yet
DISTILLATION
6 pages
NR8961 2+0 Configuration Steps
No ratings yet
NR8961 2+0 Configuration Steps
11 pages