Migration To Multi Core - Parallel Programming Models

Migration to Multi-Core Zvi Avraham, CTO [email_address]

Agenda Muli-Core and Many-Core Hardware slides (another PowerPoint) Parallel Programming Models CPU Affinity Parallel Programming APIs Win32 Threads API OpenMP Tutorial (another PowerPoint) Intel TBB – Thread Building Blocks (if time permits) Demos – C++ source code samples Summary Questions

Multi-Core vs. Many-Core Multi-Core: ≤8 cores/threads Many-Core: >8 cores/threads

Multi-Core x86 AMD Athlon X2 AMD Opteron Dual-Core AMD Barcelona Quad-Core AMD Phenom Triple-Core Pentium D Intel Core 2 Duo Intel Core 2 Quad Dual Core Xeon Quad Core Xeon

Many-Core Processors (MPU) the replacement for DSP and FPGA Sun UltraSPARC T2 / Niagara – 8 cores x 8 threads (fastest commodity processor for today) Sun UltraSPARC RK / Rock – 16 cores x 2 threads RMI XLR™ 732 Processor – 8 cores x 4 threads (MIPS) Cavium OCTEON – 16 cores (MIPS) TILERA TILE64 – 64 cores (MIPS)

Intel Tera-scale 80 cores ~ 1.81 Teraflops, 2 GB RAM on chip Currently only prototype (expected in 2009/2010)

Cell Processor SONY Playstation 3 1 main PowerPC core x 2 threads @ 3 GHz + 7 “Synregistic Processing Elements” @ 3 GHz Total – 9 threads

Xenon CPU Microsoft XBOX 360 PowerPC – 3 Cores x 2 Threads @ 3.2 GHz

NVIDIA GeForce 8800GTX 128 Stream Processors @ 1.3 Ghz, ~520GFlops

Intel Larrabee GPU Up to 48 cores x 4 threads Cores based on Pentium (x86-64) 512 bit SIMD

Hardware Slides See external PowerPoint presentation: Maximizing Desktop Application Performance on Dual-Core PC Platforms

Concurrent/Parallel/Distributed Concurrency property of systems in which several computational processes are executing at the same time, and potentially interacting with each other Parallel computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently ("in parallel") Distributed The tasks running on different interconnected computers

Levels of HW Parallelism Bit-level (i.e. from 8bit to 16bit to 32bit to 64bit CPUs) Instruction-level (CPU scheduling multiple instruction simultaneously) SIMD SMT – Simultaneous Multi-Threading CMP – Core Multi-Processing SMP – Symmetric Multi-Processing Cluster – computers with fast interconnect Grid – network of loosely connected computers

Flynn’s Taxonomy Single Instruction Multiple Instruction Single Data SISD MISD Multiple Data SIMD MIMD

Parallel Programming Models How can we write programs that run faster on a multi-core CPU? How can we write programs that do not crash on a multi-core CPU? Choose the right model! There are two fundamental parallel programming models: Shared State (Shared Memory) Message Passing (Distributed Memory) DSM (Distributed Shared Memory) – academic

Shared State Shared state concurrency involves the idea of “mutable state” (literally memory that can be changed). This is fine as long as you have only one process/thread doing the changing. If you have multiple processes sharing and modifying the same memory, you have a recipe for disaster - madness lies here. To protect against the simultaneous modification of shared memory, we use a locking mechanism. Call this a mutex, critical section, synchronized method, or whatever you like, but it’s still a lock.

Message Passing In message passing concurrency, there is no shared state. All computations are done inside processes/threads, and the only way to exchange data is through asynchronous message passing. Why this is good? No need in locks. No locks – no deadlocks (almost) No locks – deterministic execution – good for Real Time. No locks – good for I/O-bounded applications – Efficient Network Programming.

Message Passing (cont.) ActiveObject design pattern ActiveObject is an Object with it’s own thread of control and attached message queue. On arrival of message, ActiveObject wakes up and execute the command in the message. Optionally, sends the result back to caller (via callback or Future). Rational Rose Capsule design pattern The same as ActiveObject, but with attached FSM. Messages called “Events”. Events changing FSM State according to state transition table. MPI (Message Passing Interface) - de-facto standard API for Message Passing between nodes in the computational clusters.

Rational’s Capsule ROOM – Real-Time Object Oriented Modeling

Implicit vs. Explicit Parallelism Implicit Parallelism Automatic parallelization of code by compiler or library A pure implicitly parallel language does not need special directives, operators or functions to enable parallel execution. Programmer focused on core tasks, instead of worry about division on tasks and communication Less degree of control by programmer Less effective Examples: Matlab, R, LabView, NESL, ZPL, Intel Ct library for C++

Implicit vs. Explicit Parallelism Explicit Parallelism representation of concurrent computations by means of primitives in the form of special-purpose directives or function calls, also called “parallelization overhead” Most parallel primitives are related to process synchronization, communication or task partitioning Full control by programmer Examples: thread APIs, Erlang, Ada, cilk, etc.

Data Parallel vs. Task Parallel Data Parallelism a form of parallelization of computing across multiple processors in parallel computing environments. Data parallelism focuses on distributing the data across different parallel computing nodes. Task Parallelism a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing execution processes (threads) across different parallel computing nodes. Most real programs fall somewhere on a continuum between Task parallelism and Data parallelism.

Data Parallel example if CPU="a" then low_limit=1 upper_limit=50 else if CPU="b" then low_limit=51 upper_limit=100 end if for i = low_limit to upper_limit Task on d(i) end for

Task Parallel example program: ... if CPU="a" then do task "A" else if CPU="b" then do task "B" end if ... end program

“ Plain” vs Nested Data Parallelism “ Plain” Data Parallelism Good for regular data (vectors, matrices) No support for control flow Efficient impl. on SIMD or GPGPU HW Nested Data Parallelism Support irregular or sparse data Can express control flow Efficient impl. on Multi-core CPUs with SIMD Examples: [a:5, b:7] + [b:3, c:3, d:1] = [a:5, b:10, c:3, d:1] sum( [a:5, a:3, b:2, b:4, d:1] ) = [a:8, b:6, d:1]

Parallel Models Data Parallel models: Work Sharing (using Fork / Join) Parallel For Scatter / Gather Map / Reduce Split / Map / Combine / Reduce / Merge Task Parallel models: Fork / Join Recursive Fork / Join (Work Stealing scheduler) Scheduler / Workers (aka Master / Slave) Compute Grid: Executer / Scheduler / Workers

Fork / Join Barries – “joins” Parallel Regions Master Thread

Recursive Fork/Join Result solve (Problem problem) { if (problem is small) directly solve problem else { split problem into independent parts fork new subtasks to solve each part join all subtasks compose result from subresults } }

Map / Reduce Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one)

Types of Parallelism Fine-grained when tasks need to communicate many-times per second small messages low-latency Coarse-grained when tasks do not need to communicate many times per second Embarrassing parallelism tasks are independent

Embarrassingly Parallel Problems Embarrassingly Parallel Problem is one for which no particular effort is needed to segment the problem into a very large number of parallel tasks. And there is no essential dependency (or communications) between those parallel tasks. If each step can be computed independently from every other step, thus each step could be made to run on a separate processor to achieve quicker results. Examples: running the same algorithm on each grabbed frame running the same algorithm with different parameters

Extended Flynn’s Taxonomy SPMD - Single Program, Multiple Data streams Multiple autonomous processors simultaneously executing the same program (but at independent points, rather than in the lockstep that SIMD imposes) on different data. Also referred to as 'Single Process, multiple data' [6] . SPMD is the most common style of parallel programming [7] . MPMD - Multiple Program, Multiple Data streams Multiple autonomous processors simultaneously operating at least 2 independent programs. Typically such systems pick one node to be the "host" ("the explicit host/node programming model") or "manager" (the "Manager/Worker" strategy), which runs one program that farms out data to all the other nodes which all run a second program. Those other nodes then return their results directly to the manager.

CPU Affinity Changing Process and Thread Affinity

CPU Affinity The system uses a symmetric multiprocessing (SMP) model to schedule processes and threads on multiple processors. Any process/thread can be assigned to any processor. On a single CPU system threads can’t run concurrently, but using time-slicing model. On a multiple CPU system threads can run concurrently on different processors. Scheduling is still determined by thread priority. However, on a multiprocessor computer, you can also affect scheduling by setting thread affinity and thread ideal processor (i.e. “bounding” thread to specific CPU).

Why mess with CPU Affinity? Legacy Code Migration Usually Windows OS handles CPU Affinity automatically very good. So why mess with it? When migrating old applications to multi-core CPU, their performance may start degrading. This happens because OS moving application’s threads and interrupt routines between cores. By fixing application’s process and threads to CPU #0, application will start behave the same way, as on uniprocessor machine.

Why mess with CPU Affinity? Real-Time and Determinism By setting CPU affinity for our application’s thread - we get deterministic execution times, needed for Real-Time. OS will no more move our threads between cores. For example, let’s say we developing video surveillance system using dual-core CPU. We have two threads: #1 – receiving video frames from the Matrox Frame Grabber #2 – transmitting the received frames via TCP/IP to the Server By setting affinity of the receiving thread on CPU #0 and of the transmitting thread on CPU #1 – we getting deterministic execution times.

Why mess with CPU Affinity? Memory Affinity Cache locality There are some multi-core CPUs with separate L1 caches (for example: Pentium-D, some Intel’s quad-core CPUs). If OS moving thread from one core to another, the data is missing on the 2 nd core, so it must fetch data from RAM, thus we get performance degradation. NUMA On the NUMA, each memory bank connect directly to single CPU. When accessing RAM from another CPU it will be slower. By setting thread’s CPU affinity and allocating memory on the faster RAM bank, we getting performance improvement.

Changing CPU Affinity Using Task Manager – manualy, need to change each time, when running program ImageCfg.exe Utility – modify program’s EXE image – need to do it only once Process.exe Utility – change program’s affinity, without modifying it’s EXE image IntFiltr.exe – Interrupt Affinity Filter tool Win32 Affinity API: SetProcessAffinityMask SetThreadAffinityMask

ImageCFG – Affinity Mask Tool Change CPU Affinity mask for existing EXE file WARNING: modifies .EXE file – so backup before you do anything “ -u” - Marks image as uniprocessor only: imagecfg -u c:athoile.exe “ -a” – Process Affinity mask value in hex: Permanently set CPU Affinity mask to CPU #1 imagecfg -a 0x1 c:athoile.exe Permanently set CPU Affinity mask to CPU #2 imagecfg -a 0x2 c:athoile.exe

Process.exe – Get Affinity Mask “ -a” option is used in conjunction with a process name or PID, the utility will show the System Affinity Mask and the Process Affinity Mask.

Process.exe – Set Affinity Mask To set the affinity mask, simply append the binary mask after the PID/Image Name. Any leading zeros are ignored, so there is no requirement to enter the full 32 bit mask. Doesn’t modify EXE file – so need to run each time

IntFiltr.exe – Interrupt Affinity Filter Binding of device interrupts to particular processors on multiprocessor computers is a useful technique to maximize performance, scaling, and partitioning of large computers. Interrupt-Affinity Filter (IntFiltr) is an interrupt-binding tool that permits you to establish affinity for device processors on multiprocessor computers. IntFiltr uses Plug and Play features of Windows 2000 and provides a Graphical User Interface (GUI) to permit interrupt binding.

Parallel Programming APIs Threads – native OS threads API (Win32 Threads, POSIX pthreads, etc.). Low-level, highly error-prone API. Locking, etc. OpenMP – SMP standard, which allow incremental parallelization of serial code, by adding compiler directives, called “pragmas”. Intel TBB – Thread Building Blocks C++ library from Intel. High-level constructs: think in terms of tasks instead of threads (like “Parallel STL”). Auto-Parallelization

Simplicity / Complexity The native threads programming model introduces much more complexity within the code than OpenMP or Intel® TBB, making it more challenging to maintain. One of the benefits of using Intel® TBB or OpenMP when appropriate is that these APIs create and manage the thread pool for you: thread synchronization and scheduling are handled automatically .

Capabilities Comparison Intel TBB OpenMP Threads Task level parallelism + + - Data decomposition support + + - Complex parallel patterns (non-loops) + - - Generic parallel patterns + - - Scalable nested parallelism support + - - Built-in load balancing + + - Affinity support - + + Static scheduling - + - Concurrent data structures + - - Scalable memory allocator + - - I/O dominated tasks - - + User-level synch. primitives + + - Compiler support is not required + - + Cross OS support + + -

Native Threads Win32 threads – CreateThread POSIX threads boost::threads – platform independent ACE threads – platform independent, optimized for Concurrent and Network programming (used by many defense and telecom companies: Boeng, Raytheon, Elbit, Ericsson, Motorola, Lucent, Siemens, etc.)

Win32 Threads API It’s assumed, that reader/listener is familiar with basic multithreading and corresponding Win32 API

What are Win32 Threads? Microsoft Windows implementation C language interface to library Follows Win32 programming model Threads exist within single process All threads are peers No explicit parent-child model

Win32 API Hierarchy for Concurrency Windows OS Job Job Process Primary Thread Thread Process Fiber Fiber

Process Each process provides the resources needed to execute a program. A process has a virtual address space executable code open handles to system objects security context unique process identifier environment variables priority class minimum and maximum working set sizes at least one thread of execution. Each process is started with a single thread, often called the primary thread , but can create additional threads from any of its threads.

Thread A thread is the entity within a process that can be scheduled for execution. All threads of a process share its virtual address space and system resources. In addition, each thread maintains: exception handlers scheduling priority thread local storage (TLS) unique thread identifier set of structures the system will use to save the thread context until it is scheduled. The thread context includes the thread's set of machine registers, the kernel stack, a thread environment block, and a user stack in the address space of the thread's process. Threads can also have their own security context, which can be used for impersonating clients.

Job object A job object allows groups of processes to be managed as a unit. Job objects are namable, securable, sharable objects that control attributes of the processes associated with them. Operations performed on the job object affect all processes associated with the job object. Job = Process Group There is no Thread Group API in Win32 (unlike ACE C++ library or Java)

Fiber / Coroutine / Microthread A fiber is a unit of execution that must be manually scheduled by the application. Fibers run in the context of the threads that schedule them. Each thread can schedule multiple fibers. Fibers do not run simultaneously, so there is no need in locking of shared data In general, fibers do not provide advantages over a well-designed multithreaded application. However, using fibers can make it easier to port applications that were designed to schedule their own threads.

Win32 Threads API Example: CreateThread

Win32 Threads API: Critical Section

CCriticalSection & CLock example

Thread Affinity Thread affinity forces a thread to run on a specific subset of processors. Use the SetProcessAffinityMask function to specify thread affinity for all threads of the process. To set the thread affinity for a single thread, use the SetThreadAffinityMask function. The thread affinity must be a subset of the process affinity. You can obtain the current process affinity by calling the GetProcessAffinityMask function. Setting thread affinity should generally be avoided, because it can interfere with the scheduler's ability to schedule threads effectively across processors. This can decrease the performance gains produced by parallel processing.

Thread Ideal Processor When you specify a thread ideal processor , the scheduler runs the thread on the specified processor when possible. Use the SetThreadIdealProcessor function to specify a preferred processor for a thread. This does not guarantee that the ideal processor will be chosen, but provides a useful hint to the scheduler.

Our running Example: The PI program Numerical Integration  4.0 (1+x 2 ) dx =  0 1  F(x i )  x   i = 0 N Mathematically, we know that: We can approximate the integral as a sum of rectangles: Where each rectangle has width  x and height F(x i ) at the middle of interval i. F(x) = 4.0/(1+x 2 ) 4.0 2.0 1.0 X 0.0

PI: Matlab N=1000000; Step = 1/N; PI = Step*sum(4./(1+(((1:N)-0.5)*Step).^2)); Implicit Data Parallel Implemented using SIMD Later version of Matlab can utilize multiple / multi-core CPUs

PI Program: an example static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; }

OpenMP PI Program : Parallel for with a reduction #include <omp.h> static long num_steps = 100000; double step; #define NUM_THREADS 2 void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; omp_set_num_threads(NUM_THREADS); #pragma omp parallel for reduction(+:sum) private(x) for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; } OpenMP adds 2 to 4 lines of code

OpenMP Slides See external PowerPoint presentation: OpenMP Tutorial Part 1: The Core Elements of OpenMP

OpenMP Compiler Option How to enable OpenMP in MSVC++ 2005 Project -> Properties -> Configuration Properties -> C/C++ -> Command Line :

OpenMP Compiler Option How to enable OpenMP in Intel Compiler: /Qopenmp /Qopenmp_report{0|1|2}

Auto-Parallelization Intel compiler only, using compiler switch: /Qparallel /Qpar_report[n] Automatic threading of loops without having to manually insert OpenMP directives. Compiler can identify “easy” canditates for parallelization Large applications are difficult to analyze

Intel® TBB Thread Building Blocks C++ Library

MPI_Reduce Reduces values on all processes to a single value on root process Synopsis MPI_Reduce ( void *sendbuf, // address of send buffer void *recvbuf, // address of receive buffer (only at root) int count, // number of elements in send buffer (integer) MPI_Datatype datatype, // data type of elements of send buffer MPI_Op op, // reduce operation: sum, prod, min, max, etc. int root, // root rank of root process (integer) MPI_Comm comm // comm communicator (handle) )

Summary Multi-core performance improvements do not come for free in case your application is single-threaded. Except for multiple concurrent applications being run faster all together. Use Thread CPU Affinity to improve your applications performance and Real-Time deterministic execution. Use OpenMP to parallelize your existing serial C/C++ code – minimal investment and incremental way to threading. Use Intel® TBB library for your new C++ projects – high-level STL-like parallel API. For the IO-bounded Real-Time and/or Network Concurrent Programming, look at ACE Framework or use your native threads API. Be aware of potential performance pitfalls like, cache locality and FSB saturation. Benchmark everything!

Commercial Multi-core Commercial multi-core applications is not the future. It is NOW!

Migration To Multi Core - Parallel Programming Models

Related slideshows

More Related Content

Migration To Multi Core - Parallel Programming Models