Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

02 - Introduction To Concurrent Systems PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Introduction to Concurrent and Parallel

Programming

◼ Single vs. Multi-tasking Systems


◼ Applications of Concurrency
◼ Parallel Programming
◼ Methods of Implementing Multi-tasking

©Paul Davies. Not to be copied, used, or revised without explicit written


permission from the copyright owner. 1
2
Single and Multi-tasking Real Time systems

◼ In the last lecture we introduced Real Time systems and talked about
two different classifications of such systems

◼ Hard vs Soft
◼ Event vs Time driven

◼ Irrespective of those two classifications, a further classification of real


time systems exists as shown below.

◼ Single tasking systems


◼ Multi-tasking (or concurrent) systems
3
Question: What do we mean by a Single Tasking Real-Time System?
Answer: A system that runs a single sequential thread of code. That is, the
same kind of programs you have written up to now e.g. in APSC 160, CPSC
259 etc.

Sequential ‘C’ Code Assembly Code Machine Code


push %rbp 55
#include <stdio.h> mov %rsp,%rbp 48 89 e5
sub $0x10,%rsp 48 83 ec 10
movl $0x0,-0x4(%rbp) c7 45 fc 00 00 00 00
int main(void)
cmpl $0x9,-0x4(%rbp) 83 7d fc 09
{ jg 40084d <main+0x37> 7f 22
for (int i=0; i < 10; ++i) { mov -0x4(%rbp),%eax 8b 45 fc
printf(“%d\n”, i); mov %eax,%esi 89 c6
} mov $0x601060,%edi bf 60 10 60 00 55
callq e8 66 fe ff ff 48
return 0; mov $0x400700,%esi be 00 07 40 00 89
} mov %rax,%rdi 48 89 c7 e5
callq 4006f0 e8 a9 fe ff ff
addl $0x1,-0x4(%rbp) 48
83 45 fc 01
jmp 400825 <main+0xf> 83
eb d8
mov $0x0,%eax b8 00 00 00 00 ec
leaveq c9 10
retq c3 c7

Single Thread: Instructions


executed one after the other
4
What is a Multi-Tasking (Concurrent) System?

◼ A Multi-tasking, or concurrent system, involves running several parts of a


program at the same time. There are several ways to achieve this :-

◼ Multiple executable programs running in parallel (i.e. as processes).


◼ Multiple sections of a program, running concurrently (i.e. as threads)

◼ This can be achieved physically using

◼ A Single CPU (with 1 or more cores)


◼ Multiple CPUs (with 1 or more cores)
◼ Distributed Systems (i.e. a network of CPUs)

◼ (see http://en.wikipedia.org/wiki/Concurrency_(computer_science) )
5
What do we mean by CPUs and Cores?

Processor Board with 4 CPUs (2018) 8 Core I7 Single chip


(each with multiple cores) CPU with shared Cache and
Memory controller
Example Multi Tasking Using Threads : Sections of a program executed in 6
parallel (i.e. via threads) running on a single Multi-Core CPU.
Program comprising multiple threads
#include "rt.h”
UINT _ _stdcall ChildThread1( void *args )
{
for (int i = 0; i < 1000; i ++)
printf( "Hello From Thread 1\n“) ;
return 0 ;
}

UINT _ _stdcall ChildThread2( void *args )


{
for (int i = 0; i < 1000; i ++)
printf( "Hello From Thread 2\n” ) ;

return 0 ; thread thread


}

void main(void)
{ thread
// Create Threads using CThread objects
CThread t1( ChildThread1, ACTIVE, NULL) ;
CThread t2( ChildThread2, ACTIVE, NULL) ;
// wait for Threads to terminate
t1.WaitForThread() ;
t2.WaitForThread() ; How many threads?
}
What cores/CPUs do they run on?
7
Multi-Tasking: Pros and Cons

Advantages:
• Utilises CPU power - has the potential to harness the power of all available cores.
Software performance grows with number of Cores and CPUs in systems
• Flexibility – system can be distributed across several servers, perhaps in different
countries
• Scalability – a single executable can be run multiple times as separate tasks.

Challenges:
• Decomposing/architecting system into appropriate concurrent tasks
• Communication between parallel tasks
• Synchronization between parallel tasks
• Debugging and testing (especially difficult)
Example Multi Tasking in a Distributed System
8
◼ Multi-tasking is used extensively in web-servers where the same program code to
handle a connection from a single remote client is run multiple times as a thread on the
server, one for each new client, leading to highly scalable solutions.
◼ Denial of service attacks attempt to crash the server by making hundreds of thousands
of connections per second and overwhelming the CPU/Memory resources of the server.
(see http://en.wikipedia.org/wiki/Denial-of-service_attack ).

Server(s) anywhere in the world


Client Machines anywhere in the world

Server
Client 1 Thread

3 threads instantiated
from the same code to
Server handle 3 concurrent
Client 2 Internet Thread clients. They share the
same code loaded into
memory, but have their
own storage for
Client 3 Server variables/stack etc.
Thread
Another Example of Scalability 9
◼ An elevator system comprising of 4 elevators.

◼ Instead of writing one big program to control all 4 elevators, we could just
write a single task to control 1 elevator and run 4 copies of the executable
with an elevator scheduler to handle and delegate floor requests to each.

◼ Designing our software architecture around parallel tasks leads to more


scalable, flexible solutions.

Elevator
Elevator requests
main() Scheduler commands
{
Elevator requests
vs. main() main(){ main() main()
{ { {

} }
} } }

Elevator 1 Elevator 2 Elevator 3 Elevator 4


Solution 1: Big Monolithic
Single Tasking System Solution 2: Multi-Tasking System
Parallel Programming 10

◼ Parallel programming is a specialised branch of concurrent programming


which focuses on the design of algorithms intended to make software run
faster in a multiple CPU/multiple Core environment.

(Note: Concurrent programming is not generally about designing faster


algorithms or solutions, it's more often about designing scalable solutions,
with the option of having several things happen at the same time).

◼ Example: A Multi-threaded version of the Quick-Sort algorithm.


Leverages the power of multiple cores to make the sorting faster.
Parallel Programming (cont…) 11

◼ As an example, consider a program designed to repeatedly evaluate the


following mathematical expression

X = B2 – 4AC

◼ We could easily write the solution for this using 1 line of C/C++ code .

◼ However, the resulting 1 line solution will run no faster on a quad-core CPU
than it would on a single core CPU.

◼ This is because neither the compiler, CPU or the operating system have the
ability to automatically partition the problem into several smaller tasks that
can be executed in parallel.
◼ However, a programmer could theoretically architect the previous expression 12
into 3 much smaller tasks (implemented as threads on multiple cores):

◼ The 1st thread could calculate B2


◼ The 2nd thread could calculate 4AC

Note: Because neither of the above 2 expressions depends upon the outcome of the
other, these two threads could be executed in parallel (i.e. at the same time).
In fact an operating system could even allocate these new tasks to run on separate
CPUs and/or cores, at the same time.

◼ Finally we could write a 3rd thread to perform the ‘-’ (subtraction) i.e. take the
output of the previous two threads and subtract them.
13
Problems: Data Dependencies, Communication and Synchronisation
◼ Designing this 3rd thread however highlights the difficulties of parallel
programming.

◼ That 3rd thread is required to communicate with and synchronise itself to the
output of the two previous threads.

◼ Generally speaking, splitting a system into smaller parallel tasks introduces


data dependencies, where one action is dependent upon the output produced
by another, such as the subtraction earlier.

◼ The impact of this is that one thread must be designed to wait for the other
two threads to complete. This in turn limits the amount of parallelism in the
solution and thus limits the speed at which it can be calculated. Of course
pipelining the data and results might help.

(Note research in the field of parallel programming languages is attempting to create


compilers that can automatically parallelise sections of code for execution on multiple
cores, which should give a worthwhile boost to the performance of existing ‘legacy’
programs.)
Approaches to Parallel Programming 14

◼ Any fool can take an existing sequential algorithm/program, compile it and


run it on a multi-core computer. The trouble is, it won’t run any faster, and all
those extra cores will sit there idle.

◼ Designing an algorithm specifically to take advantage of multiple cores for fast


processing is very challenging and requires a dedicated (i.e. parallel
programming) course in itself.

◼ For example, you may have to throw away or modify some of those “classic”
algorithms that appear in text books and courses like CPSC 260/259 because
they have evolved to be the fastest solution for sequential execution on a
SINGLE core.
Parallel Programming Approaches 15

◼ To optimise an algorithm for parallel operation on multiple cores, may


mean that the data will have to be reorganised (i.e. perhaps repartitioned)
so that several cores can work on their own data sets independently of
each other and reduce the number of data dependencies.

◼ It’s very common and frustrating to find that after spending many hours
attempting to create a parallel version of a sequential algorithm, it often
performs worse than the sequential solution.
◼ Problems that lend themselves well to parallel processing include 16

◼ MPEG/JPEG decoding,
◼ Image processing,
◼ Weather forecasting,
◼ Finite element analysis etc.

◼ This is because it is possible to split the data processed by these types of


task into smaller independent chunks for processing on separate cores.
The more cores you have, the faster the processing (in theory).

◼ Adobe Photoshop has been written to be multi-threaded and will show a


significant performance increase when multiple cores exist.

◼ If an algorithm contains too many data dependencies then little benefit


will be seen by a parallel multi-core solution.
▪ Not all problems lend themselves to a parallel solution. For example, a 17
program to calculate the Fibonacci series

(1, 1, 2, 3, 5, 8, 13, 21, ...)

is inherently a sequential problem and will never lend itself to a faster


parallel solution.

◼ This is because the formula for calculating the next element in the series is
given by the equation

F(n + 2) = F(n + 1) + F(n)

◼ That is, each element in the series can only be calculated after the previous
two, i.e. sequentially, it is thus not possible to calculate each element in the
series in parallel. For a good overview of parallel programming visit
https://computing.llnl.gov/tutorials/parallel_comp
Speedup and Parallel Programming 18

◼ Speedup is a simple measure of the performance increase of a parallel


algorithm running on multiple cores vs. the equivalent sequential version
running on a single core.

◼ Speedup is defined by the following simple formula:

Sequential Quick Sort: 10 s


T1
Sp = Parallel Quick Sort (p=4): 7 s
Tp Speedup: 1.4x

◼ p is the number of available cores.


◼ T1 is the execution time of the sequential algorithm with 1 CPU/core.
◼ Tp is the execution time of the parallel algorithm with ‘p’ CPU/core.
Decoding time for an MPEG sequence 19

7
6
5
4
Speedup
3
2
1
0
4 2 4
# CPUs 2 1
1
# Cores per CPU

Questions:
◼ Why is speedup not linear, i.e. when you double the numbers of cores why does speedup
not appear to double?
◼ Why might doubling the number of CPUs not always yield the same speedup as doubling
the number of cores?
◼ Can speedup ever be < 1 ?
20
Implementing Multi-tasking (concurrency)

◼ Multi-tasking systems can be realised in a variety of different ways

◼ Pseudo Multi-Tasking - (i.e. also known as ‘faking’ it)


◼ Multiple Cores/CPUs - (true multi-tasking)
◼ Distributed CPUs - (Multiple CPUs communicating over a
Network)
◼ Time-Sliced Core/CPU - (compromise, used by most systems)
Approach 1 : Pseudo Multi-tasking Systems: 21

◼ Here, a cheap form of ‘fake’ multitasking is implemented by the system through clever
programming designed to make it appear as if the system is executing several tasks
repeatedly and in parallel. There’s no attempt to make anything faster.

◼ With a carefully crafted program decomposed into smaller tasks, each of which is brief
and can be executed over and over again, inside a loop, we can design a system that
gives the illusion of concurrency.

◼ The program below demonstrates this concept using a loop built into a program.

◼ Tasks are simulated via functions called repetitively within the loop.

void main(void)
{
while(1) { Tasks are ‘simulated’ with function
Monitor_Temperature() ; calls which are invoked rapidly and
Control_flow_rate() ; repeatedly to create the illusion
Update_Display() ; they are all running at the same
} time.
}
Infinite loop creates illusion that
all tasks run concurrently
22
Advantages of Pseudo-Multitasking Systems

◼ Simple learning curve, just a bunch of simple functions inside a forever loop.

◼ No complex operating system involved.

◼ No new theories required. What we learned in APSC 160 will suffice.

◼ Inter-Task Communication : Can be handled using global variables.

◼ Task Synchronisation : Automatic, only one task is actually running at any


point in time, therefore all tasks are automatically synchronised to the
completion of all other tasks each time around the loop.
23
Disadvantages to Pseudo Multi-tasking Systems
◼ Each task has to be written so that it can be called repeatedly within the
system and it has to be brief.

◼ An individual task (i.e. function) must not be written to

◼ Sit in its own internal loop consuming large amounts of CPU time.
◼ Get involved in some operation that would cause it to delay its return,
such as waiting for input from a keyboard.

◼ If either of the 2 points above are violated, then other tasks/functions in the
main loop would be prevented from executing and the illusion of
concurrency will be lost, so its use is limited.

◼ Problems with task protection. Any “task” that crashes may wipe out all the
other tasks.

◼ No easy way to assign priorities to tasks.


24
Approach 2 : Multiple CPUs on Shared Backplane
◼ Multiple processor boards plugged into
backplanes such as VME bus each with its own
CPU which is dedicated to a single
process/activity.

◼ Very High performance and expensive.

◼ Useful if the number of tasks is fixed and known


at design time. May not easily accommodate
dynamic process creation as a change in the
number of processes may require a change in
number of CPU cards. (It might be possible to
address this with an operating system inside
each processor board)

◼ CPU's/tasks communicate via shared memory


boards plugged into the backplane.

◼ Not very flexible and vulnerable to Power


supply failure
Approach 3 : Distributed Systems
Node B
Node A 25

Network

Node C
Node D
26
◼ With a distributed system each task still has it's own dedicated networked
CPU.

◼ A Real time system might require the use of a deterministic network based
on the concept of token passing or prioritised arbitration. Here a node can
transmit only when it has a token or has been given permission and can only
send one packet of information after which it has to release the token for
use by any other node. It can carry on transmitting only when it gets the
token/permission again.

◼ Here the delay in transmitting an arbitrary sized message from one node to
another can be calculated and is related to

◼ Speed of network e.g. Mbps.


◼ Size of a packet in bytes.
◼ Size of the message being transmitted in bytes.
◼ Number of nodes on network.
27
◼ Example distributed (embedded) system
Classifications of Multiple CPU Architectures - Heterogeneous 28
◼ Here a system is composed of multiple dissimilar CPUs
◼ CPUs can be chosen on basis of suitability for the task e.g. DSP for signal processing,
Graphics for Image processing, Intel/AMD etc. for general workhorse functions and
FPGAs/ASICs for dedicated hardware acceleration.
◼ System Software is complex and targets multiple CPU architectures.
◼ Interconnect of CPUs also highly specific requiring dedicated solutions.
◼ Difficult to integrate to run under a single controlling operating system environment.
◼ Not very flexible, but performance can be exceptional, perhaps even optimal.
◼ (see http://en.wikipedia.org/wiki/GPGPU for development in graphics chip
programming)

Control processors Shared Memory


C C M M M M

Specialised interconnect (backplane, networks etc)

M M M M

DSPs, FPGAs, ASICs and Graphics Accelerators


Classifications of Multiple CPU Architectures - Homogeneous 29

◼ The system is composed of multiple identical CPUs/Cores – e.g.

◼ Dual/Quad Core Processors from Intel/AMD


◼ Sometimes referred to as Symmetric multi-processing

◼ Advantages/Drawbacks
◼ Any CPU/core can run any task. For example, one CPU/core could start a
task, but have it completed by another.

◼ CPUs easier to integrate under the control of one Single Host Operating
System, e.g. Windows that controls multiple CPUs and Cores inside your
Laptop/desktop computer.

◼ OS can carry out load balancing, distributing work among available


CPU/cores

◼ Easy to add more CPUs for greater performance, i.e. scalable.

◼ Resulting architecture is not always optimal for a given problem.


Clustering 30

◼ Clustering is useful when there is a need to


Process large volumes of data in parallel e.g.

◼ Weather forecasting,
◼ Web serving where 1000's of clients
connect to a server every second

◼ “Beowulf” (for Linux) and Windows server


provide services to seamlessly connect
several networked computers and make
them appear as one big server with the
potential for massive parallel processing.

◼ http://en.wikipedia.org/wiki/Computer_cluster
◼ http://en.wikipedia.org/wiki/Grid_computing
◼ http://en.wikipedia.org/wiki/Blade_server
◼ https://en.wikipedia.org/wiki/Beowulf_cluster
Racks containing “Blade servers”,
complete slide-in computers with
disk drives all running under 1
operating system
Clustering used in Data Centers 31

◼ Clusters are often arranged into massive data centers, to handle 10’s of thousands of
simultaneous user connections. The image below is Microsoft’s data center in San
Antonio, Texas which typically would house >100,000 blade servers.
◼ Microsoft acknowledges that is has over 1 million servers around the world. (Less than
Google, more than Amazon.).

You might also like