Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Migration to  Multi-Core Zvi Avraham, CTO [email_address]
Agenda Muli-Core and Many-Core Hardware slides (another PowerPoint) Parallel Programming Models CPU Affinity Parallel Programming APIs Win32 Threads API OpenMP Tutorial (another PowerPoint) Intel TBB – Thread Building Blocks (if time permits) Demos – C++ source code samples Summary Questions
Multi-Core vs. Many-Core Multi-Core:  ≤8 cores/threads Many-Core: >8 cores/threads
Multi-Core x86 AMD Athlon X2 AMD Opteron Dual-Core AMD Barcelona Quad-Core AMD Phenom Triple-Core Pentium D Intel Core 2 Duo Intel Core 2 Quad Dual Core Xeon Quad Core Xeon
Many-Core Processors (MPU) the replacement for DSP and FPGA Sun UltraSPARC T2 / Niagara – 8 cores x 8 threads  (fastest commodity processor for today) Sun UltraSPARC RK / Rock  – 16 cores x 2 threads RMI XLR™ 732 Processor – 8 cores x 4 threads (MIPS) Cavium OCTEON – 16 cores (MIPS) TILERA TILE64 – 64 cores (MIPS)
Intel Tera-scale 80 cores ~ 1.81 Teraflops, 2 GB RAM on chip Currently only prototype (expected in 2009/2010)
Cell Processor SONY Playstation 3 1 main PowerPC core x 2 threads @ 3 GHz + 7 “Synregistic Processing Elements” @ 3 GHz Total – 9 threads
Xenon CPU Microsoft XBOX 360 PowerPC – 3 Cores x 2 Threads @ 3.2 GHz
NVIDIA GeForce 8800GTX 128 Stream Processors @ 1.3 Ghz, ~520GFlops
Intel Larrabee GPU Up to 48 cores x 4 threads Cores based on Pentium (x86-64) 512 bit SIMD
Hardware Slides See external PowerPoint presentation: Maximizing Desktop Application Performance on Dual-Core PC Platforms
Parallel Programming Models
Concurrent/Parallel/Distributed Concurrency property of systems in which several computational processes are executing at the same time, and potentially interacting with each other Parallel computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently ("in parallel") Distributed The tasks running on different interconnected computers
Levels of HW Parallelism Bit-level  (i.e. from 8bit to 16bit to 32bit to 64bit CPUs) Instruction-level  (CPU scheduling multiple instruction simultaneously) SIMD SMT – Simultaneous Multi-Threading CMP – Core Multi-Processing SMP – Symmetric Multi-Processing Cluster – computers with fast interconnect Grid – network of loosely connected computers
Flynn’s Taxonomy Single Instruction Multiple Instruction Single Data SISD MISD Multiple Data SIMD MIMD
Flynn’s Taxonomy (2)
Parallel Programming Models How can we write programs that  run faster  on a multi-core CPU?  How can we write programs that  do not crash  on a multi-core CPU?  Choose the right model! There are two fundamental parallel programming models: Shared State  (Shared Memory) Message Passing  (Distributed Memory) DSM  (Distributed Shared Memory) – academic
Shared State Shared state concurrency involves the idea of “mutable state” (literally memory that can be changed). This is fine as long as you have only one process/thread doing the changing. If you have multiple processes sharing and modifying the same memory, you have a recipe for disaster -  madness lies here. To protect against the simultaneous modification of shared memory, we use a locking mechanism. Call this a mutex, critical section, synchronized method, or whatever you like, but it’s still a lock.
Message Passing In message passing concurrency, there is no shared state. All computations are done inside processes/threads, and the only way to exchange data is through asynchronous message passing. Why this is good?  No need in locks. No locks – no deadlocks (almost) No locks – deterministic execution – good for Real Time. No locks – good for I/O-bounded applications – Efficient Network Programming.
Message Passing (cont.) ActiveObject  design pattern ActiveObject is an Object with it’s own thread of control and attached message queue. On arrival of message, ActiveObject wakes up and execute the command in the message. Optionally, sends the result back to caller (via callback or Future). Rational Rose  Capsule  design pattern The same as ActiveObject, but with attached FSM. Messages called “Events”. Events changing FSM State according to state transition table. MPI  (Message Passing Interface) - de-facto standard API for Message Passing between nodes in the computational clusters.
Active Object Design Pattern
Rational’s Capsule ROOM – Real-Time Object Oriented Modeling
Implicit vs. Explicit Parallelism Implicit Parallelism Automatic parallelization of code by compiler or library A pure implicitly parallel language does not need special directives, operators or functions to enable parallel execution. Programmer focused on core tasks, instead of worry about division on tasks and communication Less degree of control by programmer Less effective Examples: Matlab, R, LabView, NESL, ZPL, Intel Ct library for C++
Implicit vs. Explicit Parallelism Explicit Parallelism representation of concurrent computations by means of primitives in the form of special-purpose directives or function calls, also called “parallelization overhead” Most parallel primitives are related to process synchronization, communication or task partitioning Full control by programmer Examples: thread APIs, Erlang, Ada, cilk, etc.
Data Parallel vs. Task Parallel Data Parallelism a form of parallelization of computing across multiple processors in parallel computing environments. Data parallelism focuses on distributing the data across different parallel computing nodes.  Task Parallelism a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing execution processes (threads) across different parallel computing nodes. Most real programs fall somewhere on a continuum between Task parallelism and Data parallelism.
Data Parallel example if  CPU="a" then  low_limit=1  upper_limit=50  else if  CPU="b" then  low_limit=51  upper_limit=100  end if  for i = low_limit to upper_limit  Task on d(i)  end for
Task Parallel example program:  ...  if CPU="a" then  do task "A"  else if CPU="b" then  do task "B"  end if  ...  end program
“ Plain” vs Nested Data Parallelism “ Plain” Data Parallelism Good for regular data (vectors, matrices) No support for control flow Efficient impl. on SIMD or GPGPU HW Nested Data Parallelism Support irregular or sparse data Can express control flow Efficient impl. on Multi-core CPUs with SIMD Examples: [a:5, b:7] + [b:3, c:3, d:1] = [a:5, b:10, c:3, d:1] sum( [a:5, a:3, b:2, b:4, d:1] ) = [a:8, b:6, d:1]
Parallel Models Data Parallel models: Work Sharing (using Fork / Join) Parallel For Scatter / Gather Map / Reduce Split / Map / Combine / Reduce / Merge Task Parallel models: Fork / Join Recursive Fork / Join (Work Stealing scheduler) Scheduler / Workers (aka Master / Slave) Compute Grid:  Executer / Scheduler / Workers
Scatter / Gather
Fork / Join Barries – “joins” Parallel Regions Master Thread
Recursive Fork/Join Result  solve  (Problem problem) {                if (problem is small)                         directly solve problem              else {                      split problem into independent parts                         fork  new subtasks to  solve  each part                         join  all subtasks                        compose result from subresults               }  }
Map / Reduce Input & Output: each a set of key/value pairs  Programmer specifies two functions: map (in_key, in_value) ->  list(out_key, intermediate_value)  Processes input key/value pair  Produces set of intermediate pairs  reduce (out_key, list(intermediate_value)) ->  list(out_value)  Combines all intermediate values for a particular key  Produces a set of merged output values (usually just one)
Map / Reduce
Types of Parallelism Fine-grained when tasks need to communicate many-times per second small messages low-latency Coarse-grained when tasks do not need to communicate many times per second  Embarrassing parallelism tasks are independent
Embarrassingly Parallel Problems Embarrassingly Parallel Problem  is one for which no particular effort is needed to segment the problem into a very large number of parallel tasks. And there is no essential dependency (or communications) between those parallel tasks. If each step can be computed independently from every other step, thus each step could be made to run on a separate processor to achieve quicker results. Examples: running the same algorithm on each grabbed frame running the same algorithm with different parameters
Extended Flynn’s Taxonomy SPMD  -  Single Program, Multiple Data streams   Multiple autonomous processors simultaneously executing the same program (but at independent points, rather than in the  lockstep  that SIMD imposes) on different data. Also referred to as 'Single Process, multiple data' [6] . SPMD is the most common style of parallel programming [7] .  MPMD  -  Multiple Program, Multiple Data streams Multiple autonomous processors simultaneously operating at least 2 independent programs. Typically such systems pick one node to be the "host" ("the explicit host/node programming model") or "manager" (the "Manager/Worker" strategy), which runs one program that farms out data to all the other nodes which all run a second program. Those other nodes then return their results directly to the manager.
CPU Affinity Changing Process and Thread Affinity
CPU Affinity The system uses a symmetric multiprocessing (SMP) model to schedule processes and threads on multiple processors. Any process/thread can be assigned to any processor.  On a single CPU system threads can’t run concurrently, but using time-slicing model. On a multiple CPU system threads can run concurrently on different processors. Scheduling is still determined by thread priority. However, on a multiprocessor computer, you can also affect scheduling by setting thread affinity and thread ideal processor (i.e. “bounding” thread to specific CPU).
Why mess with CPU Affinity? Legacy Code Migration Usually Windows OS handles CPU Affinity automatically very good.  So why mess with it? When migrating old applications to multi-core CPU, their performance may start degrading. This happens because OS moving application’s threads and interrupt routines between cores. By fixing application’s process and threads to CPU #0, application will start behave the same way, as on uniprocessor machine.
Why mess with CPU Affinity?  Real-Time and Determinism By setting CPU affinity for our application’s thread - we get deterministic execution times, needed for Real-Time. OS will no more move our threads between cores. For example, let’s say we developing video surveillance system using dual-core CPU. We have two threads: #1 – receiving video frames from the Matrox Frame Grabber #2 – transmitting the received frames via TCP/IP to the Server By setting affinity of the receiving thread on CPU #0 and of the transmitting thread on CPU #1 – we getting deterministic execution times.
Why mess with CPU Affinity?  Memory Affinity Cache locality There are some multi-core CPUs with separate L1 caches (for example: Pentium-D, some Intel’s quad-core CPUs). If OS moving thread from one core to another, the data is missing on the 2 nd  core, so it must fetch data from RAM, thus we get performance degradation. NUMA On the NUMA, each memory bank connect directly to single CPU. When accessing RAM from another CPU it will be slower. By setting thread’s CPU affinity and allocating memory on the faster RAM bank, we getting performance improvement.
Changing CPU Affinity Using  Task Manager  – manualy, need to change each time, when running program ImageCfg.exe  Utility – modify program’s EXE image – need to do it only once Process.exe  Utility – change program’s affinity, without modifying it’s EXE image IntFiltr.exe  – Interrupt Affinity Filter tool  Win32 Affinity API: SetProcessAffinityMask   SetThreadAffinityMask
CPU Affinity – Task Manager
CPU Affinity – Task Manager
ImageCFG – Affinity Mask Tool Change CPU Affinity mask for existing EXE file WARNING: modifies .EXE file – so backup before you do anything “ -u” - Marks image as uniprocessor only: imagecfg -u c:athoile.exe “ -a” – Process Affinity mask value in hex: Permanently set CPU Affinity mask to CPU #1 imagecfg -a 0x1 c:athoile.exe Permanently set CPU Affinity mask to CPU #2 imagecfg -a 0x2 c:athoile.exe
Process.exe – Get Affinity Mask “ -a” option is used in conjunction with a process name or PID, the utility will show the System Affinity Mask and the Process Affinity Mask.
Process.exe – Set Affinity Mask To set the affinity mask, simply append the binary mask after the PID/Image Name. Any leading zeros are ignored, so there is no requirement to enter the full 32 bit mask.  Doesn’t modify EXE file – so need to run each time
IntFiltr.exe   – Interrupt Affinity Filter Binding of device interrupts to particular processors on multiprocessor computers is a useful technique to maximize performance, scaling, and partitioning of large computers. Interrupt-Affinity Filter (IntFiltr) is an interrupt-binding tool that permits you to establish affinity for device processors on multiprocessor computers. IntFiltr uses Plug and Play features of Windows 2000 and provides a Graphical User Interface (GUI) to permit interrupt binding.
Parallel Programming APIs
Parallel Programming APIs Threads  – native OS threads API (Win32 Threads, POSIX pthreads, etc.). Low-level, highly error-prone API. Locking, etc. OpenMP  – SMP standard, which allow incremental parallelization of serial code, by adding compiler directives, called “pragmas”. Intel TBB  – Thread Building Blocks C++ library from Intel. High-level constructs: think in terms of tasks instead of threads (like “Parallel STL”). Auto-Parallelization
Simplicity / Complexity The native threads programming model introduces much more complexity within the code than OpenMP or Intel® TBB, making it more challenging to maintain. One of the benefits of using Intel® TBB or OpenMP when appropriate is that these APIs create and manage the thread pool for you:  thread synchronization and scheduling are handled automatically .
Capabilities Comparison Intel TBB OpenMP Threads Task level parallelism + + - Data decomposition support + + - Complex parallel patterns (non-loops) + - - Generic parallel patterns + - - Scalable nested parallelism support + - - Built-in load balancing + + - Affinity support - + + Static scheduling - + - Concurrent data structures + - - Scalable memory allocator  + - - I/O dominated tasks - - + User-level synch. primitives + + - Compiler support is not required + - + Cross OS support + + -
Native Threads Win32 threads – CreateThread POSIX threads boost::threads – platform independent ACE threads – platform independent, optimized for Concurrent and Network programming (used by many defense and telecom companies: Boeng, Raytheon, Elbit, Ericsson, Motorola, Lucent, Siemens, etc.)
Win32 Threads API It’s assumed, that reader/listener is familiar with basic multithreading and corresponding Win32 API
What are Win32 Threads? Microsoft Windows implementation C language interface to library Follows Win32 programming model Threads exist within single process All threads are peers No explicit parent-child model
Win32 API Hierarchy for Concurrency Windows OS Job Job Process Primary Thread Thread Process Fiber Fiber
Process Each  process  provides the resources needed to execute a program. A process has a  virtual address space executable code open handles to system objects security context unique process identifier environment variables priority class minimum and maximum working set sizes at least one thread of execution. Each process is started with a single thread, often called the  primary thread , but can create additional threads from any of its threads.
Thread A  thread   is the entity within a process that can be scheduled for execution.  All threads of a process share its virtual address space and system resources.  In addition, each thread maintains: exception handlers scheduling priority thread local storage (TLS) unique thread identifier set of structures the system will use to save the thread context until it is scheduled. The  thread context  includes the thread's set of machine registers, the kernel stack, a thread environment block, and a user stack in the address space of the thread's process. Threads can also have their own security context, which can be used for impersonating clients.
Job object A  job object  allows groups of processes to be managed as a unit. Job objects are namable, securable, sharable objects that control attributes of the processes associated with them. Operations performed on the job object affect all processes associated with the job object. Job = Process Group There is no Thread Group API in Win32 (unlike ACE C++ library or Java)
Fiber / Coroutine / Microthread A  fiber   is a unit of execution that must be manually scheduled by the application. Fibers run in the context of the threads that schedule them. Each thread can schedule multiple fibers. Fibers do not run simultaneously, so there is no need in locking of shared data In general, fibers do not provide advantages over a well-designed multithreaded application. However, using fibers can make it easier to port applications that were designed to schedule their own threads.
Win32 Threads API Example: CreateThread
Win32 Threads API: Critical Section
CCriticalSection C++ Class
CCriticalSection & CLock example
Thread Affinity Thread affinity  forces a thread to run on a specific subset of processors. Use the  SetProcessAffinityMask  function to specify thread affinity for all threads of the process.  To set the thread affinity for a single thread, use the  SetThreadAffinityMask  function. The thread affinity must be a subset of the process affinity.  You can obtain the current process affinity by calling the  GetProcessAffinityMask  function. Setting thread affinity should generally be avoided, because it can interfere with the scheduler's ability to schedule threads effectively across processors. This can decrease the performance gains produced by parallel processing.
Thread Ideal Processor When you specify a  thread ideal processor , the scheduler runs the thread on the specified processor when possible.  Use the  SetThreadIdealProcessor  function to specify a preferred processor for a thread.  This does not guarantee that the ideal processor will be chosen, but provides a useful hint to the scheduler.
Our running Example:  The PI program Numerical Integration    4.0 (1+x 2 ) dx =   0 1    F(x i )  x       i = 0 N Mathematically, we know that: We can approximate the integral as a sum of rectangles: Where each rectangle has width   x and height F(x i ) at the middle of interval i. F(x) = 4.0/(1+x 2 ) 4.0 2.0 1.0 X 0.0
PI: Matlab N=1000000;  Step = 1/N;  PI = Step*sum(4./(1+(((1:N)-0.5)*Step).^2)); Implicit Data Parallel Implemented using SIMD Later version of Matlab can utilize multiple / multi-core CPUs
PI Program: an example static long num_steps = 100000; double step; void main () {   int i;    double x, pi, sum = 0.0;   step = 1.0/(double) num_steps;   for (i=1;i<= num_steps; i++){   x = (i-0.5)*step;   sum = sum + 4.0/(1.0+x*x);   }   pi = step * sum; }
OpenMP PI Program :  Parallel for with a reduction #include <omp.h> static long num_steps = 100000;  double step; #define NUM_THREADS 2 void main () {   int i;    double x, pi, sum = 0.0;   step = 1.0/(double) num_steps;   omp_set_num_threads(NUM_THREADS); #pragma omp parallel for reduction(+:sum) private(x)   for (i=1;i<= num_steps; i++){   x = (i-0.5)*step;   sum = sum + 4.0/(1.0+x*x);   }   pi = step * sum; } OpenMP adds 2 to 4 lines of code
OpenMP
OpenMP Slides See external PowerPoint presentation: OpenMP Tutorial Part 1: The Core Elements of OpenMP
OpenMP Compiler Option How to enable OpenMP in MSVC++ 2005 Project -> Properties -> Configuration Properties -> C/C++ -> Command Line :
OpenMP Compiler Option How to enable OpenMP in Intel Compiler: /Qopenmp /Qopenmp_report{0|1|2}
Auto-Parallelization Intel compiler only, using compiler switch: /Qparallel /Qpar_report[n] Automatic threading of loops without having to manually insert OpenMP directives. Compiler can identify “easy” canditates for parallelization Large applications are difficult to analyze
Intel® TBB Thread Building Blocks C++ Library
Calculate PI using TBB
Calculate PI in MPI
MPI_Reduce Reduces values on all processes to a single value on root process Synopsis MPI_Reduce  (  void  *sendbuf,  // address of send buffer   void  *recvbuf,  // address of receive buffer (only at root)   int  count,  // number of elements in send buffer (integer) MPI_Datatype  datatype,  // data type of elements of send buffer   MPI_Op  op,  // reduce operation: sum, prod, min, max, etc. int  root,  // root rank of root process (integer) MPI_Comm  comm  // comm communicator (handle)   )
Summary Multi-core performance improvements do not come for free in case your application is single-threaded. Except for multiple concurrent applications being run faster all together. Use Thread  CPU Affinity  to improve your applications performance and Real-Time deterministic execution. Use  OpenMP  to parallelize your existing serial C/C++ code – minimal investment and incremental way to threading. Use  Intel® TBB  library for your new C++ projects – high-level STL-like parallel API. For the IO-bounded Real-Time and/or Network Concurrent Programming, look at  ACE Framework  or use your native threads API. Be aware of potential performance pitfalls like, cache locality and FSB saturation. Benchmark everything!
Commercial Multi-core Commercial multi-core applications is not the future. It is  NOW!
Any Questions?
Thank you!

More Related Content

Migration To Multi Core - Parallel Programming Models

  • 1. Migration to Multi-Core Zvi Avraham, CTO [email_address]
  • 2. Agenda Muli-Core and Many-Core Hardware slides (another PowerPoint) Parallel Programming Models CPU Affinity Parallel Programming APIs Win32 Threads API OpenMP Tutorial (another PowerPoint) Intel TBB – Thread Building Blocks (if time permits) Demos – C++ source code samples Summary Questions
  • 3. Multi-Core vs. Many-Core Multi-Core: ≤8 cores/threads Many-Core: >8 cores/threads
  • 4. Multi-Core x86 AMD Athlon X2 AMD Opteron Dual-Core AMD Barcelona Quad-Core AMD Phenom Triple-Core Pentium D Intel Core 2 Duo Intel Core 2 Quad Dual Core Xeon Quad Core Xeon
  • 5. Many-Core Processors (MPU) the replacement for DSP and FPGA Sun UltraSPARC T2 / Niagara – 8 cores x 8 threads (fastest commodity processor for today) Sun UltraSPARC RK / Rock – 16 cores x 2 threads RMI XLR™ 732 Processor – 8 cores x 4 threads (MIPS) Cavium OCTEON – 16 cores (MIPS) TILERA TILE64 – 64 cores (MIPS)
  • 6. Intel Tera-scale 80 cores ~ 1.81 Teraflops, 2 GB RAM on chip Currently only prototype (expected in 2009/2010)
  • 7. Cell Processor SONY Playstation 3 1 main PowerPC core x 2 threads @ 3 GHz + 7 “Synregistic Processing Elements” @ 3 GHz Total – 9 threads
  • 8. Xenon CPU Microsoft XBOX 360 PowerPC – 3 Cores x 2 Threads @ 3.2 GHz
  • 9. NVIDIA GeForce 8800GTX 128 Stream Processors @ 1.3 Ghz, ~520GFlops
  • 10. Intel Larrabee GPU Up to 48 cores x 4 threads Cores based on Pentium (x86-64) 512 bit SIMD
  • 11. Hardware Slides See external PowerPoint presentation: Maximizing Desktop Application Performance on Dual-Core PC Platforms
  • 13. Concurrent/Parallel/Distributed Concurrency property of systems in which several computational processes are executing at the same time, and potentially interacting with each other Parallel computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently (&quot;in parallel&quot;) Distributed The tasks running on different interconnected computers
  • 14. Levels of HW Parallelism Bit-level (i.e. from 8bit to 16bit to 32bit to 64bit CPUs) Instruction-level (CPU scheduling multiple instruction simultaneously) SIMD SMT – Simultaneous Multi-Threading CMP – Core Multi-Processing SMP – Symmetric Multi-Processing Cluster – computers with fast interconnect Grid – network of loosely connected computers
  • 15. Flynn’s Taxonomy Single Instruction Multiple Instruction Single Data SISD MISD Multiple Data SIMD MIMD
  • 17. Parallel Programming Models How can we write programs that run faster on a multi-core CPU? How can we write programs that do not crash on a multi-core CPU? Choose the right model! There are two fundamental parallel programming models: Shared State (Shared Memory) Message Passing (Distributed Memory) DSM (Distributed Shared Memory) – academic
  • 18. Shared State Shared state concurrency involves the idea of “mutable state” (literally memory that can be changed). This is fine as long as you have only one process/thread doing the changing. If you have multiple processes sharing and modifying the same memory, you have a recipe for disaster - madness lies here. To protect against the simultaneous modification of shared memory, we use a locking mechanism. Call this a mutex, critical section, synchronized method, or whatever you like, but it’s still a lock.
  • 19. Message Passing In message passing concurrency, there is no shared state. All computations are done inside processes/threads, and the only way to exchange data is through asynchronous message passing. Why this is good? No need in locks. No locks – no deadlocks (almost) No locks – deterministic execution – good for Real Time. No locks – good for I/O-bounded applications – Efficient Network Programming.
  • 20. Message Passing (cont.) ActiveObject design pattern ActiveObject is an Object with it’s own thread of control and attached message queue. On arrival of message, ActiveObject wakes up and execute the command in the message. Optionally, sends the result back to caller (via callback or Future). Rational Rose Capsule design pattern The same as ActiveObject, but with attached FSM. Messages called “Events”. Events changing FSM State according to state transition table. MPI (Message Passing Interface) - de-facto standard API for Message Passing between nodes in the computational clusters.
  • 22. Rational’s Capsule ROOM – Real-Time Object Oriented Modeling
  • 23. Implicit vs. Explicit Parallelism Implicit Parallelism Automatic parallelization of code by compiler or library A pure implicitly parallel language does not need special directives, operators or functions to enable parallel execution. Programmer focused on core tasks, instead of worry about division on tasks and communication Less degree of control by programmer Less effective Examples: Matlab, R, LabView, NESL, ZPL, Intel Ct library for C++
  • 24. Implicit vs. Explicit Parallelism Explicit Parallelism representation of concurrent computations by means of primitives in the form of special-purpose directives or function calls, also called “parallelization overhead” Most parallel primitives are related to process synchronization, communication or task partitioning Full control by programmer Examples: thread APIs, Erlang, Ada, cilk, etc.
  • 25. Data Parallel vs. Task Parallel Data Parallelism a form of parallelization of computing across multiple processors in parallel computing environments. Data parallelism focuses on distributing the data across different parallel computing nodes. Task Parallelism a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing execution processes (threads) across different parallel computing nodes. Most real programs fall somewhere on a continuum between Task parallelism and Data parallelism.
  • 26. Data Parallel example if CPU=&quot;a&quot; then low_limit=1 upper_limit=50 else if CPU=&quot;b&quot; then low_limit=51 upper_limit=100 end if for i = low_limit to upper_limit Task on d(i) end for
  • 27. Task Parallel example program: ... if CPU=&quot;a&quot; then do task &quot;A&quot; else if CPU=&quot;b&quot; then do task &quot;B&quot; end if ... end program
  • 28. “ Plain” vs Nested Data Parallelism “ Plain” Data Parallelism Good for regular data (vectors, matrices) No support for control flow Efficient impl. on SIMD or GPGPU HW Nested Data Parallelism Support irregular or sparse data Can express control flow Efficient impl. on Multi-core CPUs with SIMD Examples: [a:5, b:7] + [b:3, c:3, d:1] = [a:5, b:10, c:3, d:1] sum( [a:5, a:3, b:2, b:4, d:1] ) = [a:8, b:6, d:1]
  • 29. Parallel Models Data Parallel models: Work Sharing (using Fork / Join) Parallel For Scatter / Gather Map / Reduce Split / Map / Combine / Reduce / Merge Task Parallel models: Fork / Join Recursive Fork / Join (Work Stealing scheduler) Scheduler / Workers (aka Master / Slave) Compute Grid: Executer / Scheduler / Workers
  • 31. Fork / Join Barries – “joins” Parallel Regions Master Thread
  • 32. Recursive Fork/Join Result solve (Problem problem) {               if (problem is small)                        directly solve problem             else {                     split problem into independent parts                       fork new subtasks to solve each part                       join all subtasks                       compose result from subresults              } }
  • 33. Map / Reduce Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one)
  • 35. Types of Parallelism Fine-grained when tasks need to communicate many-times per second small messages low-latency Coarse-grained when tasks do not need to communicate many times per second Embarrassing parallelism tasks are independent
  • 36. Embarrassingly Parallel Problems Embarrassingly Parallel Problem is one for which no particular effort is needed to segment the problem into a very large number of parallel tasks. And there is no essential dependency (or communications) between those parallel tasks. If each step can be computed independently from every other step, thus each step could be made to run on a separate processor to achieve quicker results. Examples: running the same algorithm on each grabbed frame running the same algorithm with different parameters
  • 37. Extended Flynn’s Taxonomy SPMD - Single Program, Multiple Data streams Multiple autonomous processors simultaneously executing the same program (but at independent points, rather than in the lockstep that SIMD imposes) on different data. Also referred to as 'Single Process, multiple data' [6] . SPMD is the most common style of parallel programming [7] . MPMD - Multiple Program, Multiple Data streams Multiple autonomous processors simultaneously operating at least 2 independent programs. Typically such systems pick one node to be the &quot;host&quot; (&quot;the explicit host/node programming model&quot;) or &quot;manager&quot; (the &quot;Manager/Worker&quot; strategy), which runs one program that farms out data to all the other nodes which all run a second program. Those other nodes then return their results directly to the manager.
  • 38. CPU Affinity Changing Process and Thread Affinity
  • 39. CPU Affinity The system uses a symmetric multiprocessing (SMP) model to schedule processes and threads on multiple processors. Any process/thread can be assigned to any processor. On a single CPU system threads can’t run concurrently, but using time-slicing model. On a multiple CPU system threads can run concurrently on different processors. Scheduling is still determined by thread priority. However, on a multiprocessor computer, you can also affect scheduling by setting thread affinity and thread ideal processor (i.e. “bounding” thread to specific CPU).
  • 40. Why mess with CPU Affinity? Legacy Code Migration Usually Windows OS handles CPU Affinity automatically very good. So why mess with it? When migrating old applications to multi-core CPU, their performance may start degrading. This happens because OS moving application’s threads and interrupt routines between cores. By fixing application’s process and threads to CPU #0, application will start behave the same way, as on uniprocessor machine.
  • 41. Why mess with CPU Affinity? Real-Time and Determinism By setting CPU affinity for our application’s thread - we get deterministic execution times, needed for Real-Time. OS will no more move our threads between cores. For example, let’s say we developing video surveillance system using dual-core CPU. We have two threads: #1 – receiving video frames from the Matrox Frame Grabber #2 – transmitting the received frames via TCP/IP to the Server By setting affinity of the receiving thread on CPU #0 and of the transmitting thread on CPU #1 – we getting deterministic execution times.
  • 42. Why mess with CPU Affinity? Memory Affinity Cache locality There are some multi-core CPUs with separate L1 caches (for example: Pentium-D, some Intel’s quad-core CPUs). If OS moving thread from one core to another, the data is missing on the 2 nd core, so it must fetch data from RAM, thus we get performance degradation. NUMA On the NUMA, each memory bank connect directly to single CPU. When accessing RAM from another CPU it will be slower. By setting thread’s CPU affinity and allocating memory on the faster RAM bank, we getting performance improvement.
  • 43. Changing CPU Affinity Using Task Manager – manualy, need to change each time, when running program ImageCfg.exe Utility – modify program’s EXE image – need to do it only once Process.exe Utility – change program’s affinity, without modifying it’s EXE image IntFiltr.exe – Interrupt Affinity Filter tool Win32 Affinity API: SetProcessAffinityMask SetThreadAffinityMask
  • 44. CPU Affinity – Task Manager
  • 45. CPU Affinity – Task Manager
  • 46. ImageCFG – Affinity Mask Tool Change CPU Affinity mask for existing EXE file WARNING: modifies .EXE file – so backup before you do anything “ -u” - Marks image as uniprocessor only: imagecfg -u c:athoile.exe “ -a” – Process Affinity mask value in hex: Permanently set CPU Affinity mask to CPU #1 imagecfg -a 0x1 c:athoile.exe Permanently set CPU Affinity mask to CPU #2 imagecfg -a 0x2 c:athoile.exe
  • 47. Process.exe – Get Affinity Mask “ -a” option is used in conjunction with a process name or PID, the utility will show the System Affinity Mask and the Process Affinity Mask.
  • 48. Process.exe – Set Affinity Mask To set the affinity mask, simply append the binary mask after the PID/Image Name. Any leading zeros are ignored, so there is no requirement to enter the full 32 bit mask. Doesn’t modify EXE file – so need to run each time
  • 49. IntFiltr.exe – Interrupt Affinity Filter Binding of device interrupts to particular processors on multiprocessor computers is a useful technique to maximize performance, scaling, and partitioning of large computers. Interrupt-Affinity Filter (IntFiltr) is an interrupt-binding tool that permits you to establish affinity for device processors on multiprocessor computers. IntFiltr uses Plug and Play features of Windows 2000 and provides a Graphical User Interface (GUI) to permit interrupt binding.
  • 51. Parallel Programming APIs Threads – native OS threads API (Win32 Threads, POSIX pthreads, etc.). Low-level, highly error-prone API. Locking, etc. OpenMP – SMP standard, which allow incremental parallelization of serial code, by adding compiler directives, called “pragmas”. Intel TBB – Thread Building Blocks C++ library from Intel. High-level constructs: think in terms of tasks instead of threads (like “Parallel STL”). Auto-Parallelization
  • 52. Simplicity / Complexity The native threads programming model introduces much more complexity within the code than OpenMP or Intel® TBB, making it more challenging to maintain. One of the benefits of using Intel® TBB or OpenMP when appropriate is that these APIs create and manage the thread pool for you: thread synchronization and scheduling are handled automatically .
  • 53. Capabilities Comparison Intel TBB OpenMP Threads Task level parallelism + + - Data decomposition support + + - Complex parallel patterns (non-loops) + - - Generic parallel patterns + - - Scalable nested parallelism support + - - Built-in load balancing + + - Affinity support - + + Static scheduling - + - Concurrent data structures + - - Scalable memory allocator + - - I/O dominated tasks - - + User-level synch. primitives + + - Compiler support is not required + - + Cross OS support + + -
  • 54. Native Threads Win32 threads – CreateThread POSIX threads boost::threads – platform independent ACE threads – platform independent, optimized for Concurrent and Network programming (used by many defense and telecom companies: Boeng, Raytheon, Elbit, Ericsson, Motorola, Lucent, Siemens, etc.)
  • 55. Win32 Threads API It’s assumed, that reader/listener is familiar with basic multithreading and corresponding Win32 API
  • 56. What are Win32 Threads? Microsoft Windows implementation C language interface to library Follows Win32 programming model Threads exist within single process All threads are peers No explicit parent-child model
  • 57. Win32 API Hierarchy for Concurrency Windows OS Job Job Process Primary Thread Thread Process Fiber Fiber
  • 58. Process Each process provides the resources needed to execute a program. A process has a virtual address space executable code open handles to system objects security context unique process identifier environment variables priority class minimum and maximum working set sizes at least one thread of execution. Each process is started with a single thread, often called the primary thread , but can create additional threads from any of its threads.
  • 59. Thread A thread is the entity within a process that can be scheduled for execution. All threads of a process share its virtual address space and system resources. In addition, each thread maintains: exception handlers scheduling priority thread local storage (TLS) unique thread identifier set of structures the system will use to save the thread context until it is scheduled. The thread context includes the thread's set of machine registers, the kernel stack, a thread environment block, and a user stack in the address space of the thread's process. Threads can also have their own security context, which can be used for impersonating clients.
  • 60. Job object A job object allows groups of processes to be managed as a unit. Job objects are namable, securable, sharable objects that control attributes of the processes associated with them. Operations performed on the job object affect all processes associated with the job object. Job = Process Group There is no Thread Group API in Win32 (unlike ACE C++ library or Java)
  • 61. Fiber / Coroutine / Microthread A fiber is a unit of execution that must be manually scheduled by the application. Fibers run in the context of the threads that schedule them. Each thread can schedule multiple fibers. Fibers do not run simultaneously, so there is no need in locking of shared data In general, fibers do not provide advantages over a well-designed multithreaded application. However, using fibers can make it easier to port applications that were designed to schedule their own threads.
  • 62. Win32 Threads API Example: CreateThread
  • 63. Win32 Threads API: Critical Section
  • 66. Thread Affinity Thread affinity forces a thread to run on a specific subset of processors. Use the SetProcessAffinityMask function to specify thread affinity for all threads of the process. To set the thread affinity for a single thread, use the SetThreadAffinityMask function. The thread affinity must be a subset of the process affinity. You can obtain the current process affinity by calling the GetProcessAffinityMask function. Setting thread affinity should generally be avoided, because it can interfere with the scheduler's ability to schedule threads effectively across processors. This can decrease the performance gains produced by parallel processing.
  • 67. Thread Ideal Processor When you specify a thread ideal processor , the scheduler runs the thread on the specified processor when possible. Use the SetThreadIdealProcessor function to specify a preferred processor for a thread. This does not guarantee that the ideal processor will be chosen, but provides a useful hint to the scheduler.
  • 68. Our running Example: The PI program Numerical Integration  4.0 (1+x 2 ) dx =  0 1  F(x i )  x   i = 0 N Mathematically, we know that: We can approximate the integral as a sum of rectangles: Where each rectangle has width  x and height F(x i ) at the middle of interval i. F(x) = 4.0/(1+x 2 ) 4.0 2.0 1.0 X 0.0
  • 69. PI: Matlab N=1000000; Step = 1/N; PI = Step*sum(4./(1+(((1:N)-0.5)*Step).^2)); Implicit Data Parallel Implemented using SIMD Later version of Matlab can utilize multiple / multi-core CPUs
  • 70. PI Program: an example static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; }
  • 71. OpenMP PI Program : Parallel for with a reduction #include <omp.h> static long num_steps = 100000; double step; #define NUM_THREADS 2 void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; omp_set_num_threads(NUM_THREADS); #pragma omp parallel for reduction(+:sum) private(x) for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; } OpenMP adds 2 to 4 lines of code
  • 73. OpenMP Slides See external PowerPoint presentation: OpenMP Tutorial Part 1: The Core Elements of OpenMP
  • 74. OpenMP Compiler Option How to enable OpenMP in MSVC++ 2005 Project -> Properties -> Configuration Properties -> C/C++ -> Command Line :
  • 75. OpenMP Compiler Option How to enable OpenMP in Intel Compiler: /Qopenmp /Qopenmp_report{0|1|2}
  • 76. Auto-Parallelization Intel compiler only, using compiler switch: /Qparallel /Qpar_report[n] Automatic threading of loops without having to manually insert OpenMP directives. Compiler can identify “easy” canditates for parallelization Large applications are difficult to analyze
  • 77. Intel® TBB Thread Building Blocks C++ Library
  • 80. MPI_Reduce Reduces values on all processes to a single value on root process Synopsis MPI_Reduce ( void *sendbuf, // address of send buffer void *recvbuf, // address of receive buffer (only at root) int count, // number of elements in send buffer (integer) MPI_Datatype datatype, // data type of elements of send buffer MPI_Op op, // reduce operation: sum, prod, min, max, etc. int root, // root rank of root process (integer) MPI_Comm comm // comm communicator (handle) )
  • 81. Summary Multi-core performance improvements do not come for free in case your application is single-threaded. Except for multiple concurrent applications being run faster all together. Use Thread CPU Affinity to improve your applications performance and Real-Time deterministic execution. Use OpenMP to parallelize your existing serial C/C++ code – minimal investment and incremental way to threading. Use Intel® TBB library for your new C++ projects – high-level STL-like parallel API. For the IO-bounded Real-Time and/or Network Concurrent Programming, look at ACE Framework or use your native threads API. Be aware of potential performance pitfalls like, cache locality and FSB saturation. Benchmark everything!
  • 82. Commercial Multi-core Commercial multi-core applications is not the future. It is NOW!