The document discusses parallel programming approaches for multicore processors, advocating for using Haskell and embracing diverse approaches like task parallelism with explicit threads, semi-implicit parallelism by evaluating pure functions in parallel, and data parallelism. It argues that functional programming is well-suited for parallel programming due to its avoidance of side effects and mutable state, but that different problems require different solutions and no single approach is a silver bullet.
2. The free lunch is over. Muticores are here. We have
to program them. This is hard. Yada-yada-yada.
Programming parallel computers
Plan A. Start with a language whose computational fabric is
by-default sequential, and by heroic means make the program
parallel
Plan B. Start with a language whose computational fabric is
by-default parallel
Every successful large-scale application of parallelism
has been largely declarative and value-oriented
SQL Server
LINQ
Map/Reduce
Scientific computation
Plan B will win. Parallel programming will increasingly
mean functional programming
3. Parallel functional programming was tried in
the 80’s, and basically failed to deliver
Then Now
Uniprocessors were getting faster Uniprocessors are stalled
really, really quickly.
Our compilers were crappy naive, so Compilers are pretty good
constant factors were bad
The parallel guys were a dedicated They are regular Joe Developers
band of super-talented programmers
who would burn any number of cycles
to make their supercomputer smoke.
Parallel computers were really Everyone will has 8, 16, 32 cores,
expensive, so you needed 95% whether they use them or not. Even
utilisation using 4 of them (with little effort) would
be a Jolly Good Thing
4. Parallel functional programming was tried in
the 80’s, and basically failed to deliver
Then Now
We had no story about Lots of progress
(a) locality, • Software transactional memory
(b) exploiting regularity, and • Distributed memory
(c) granularity • Data parallelism
• Generating code for GPUs
This talk
5. “Just use a functional language and your
troubles are over”
Right idea:
No side effects Limited side effects
Strong guarantees that sub-computations do not
interfere
But far too starry eyed. No silver bullet:
one size does not fit all
need to “think parallel”: if the algorithm has
sequential data dependencies, no language will
save you!
6. A “cost model” gives
Different problems need the programmer some
different solutions. idea of what an
Shared memory vs distributed memory operation costs,
without burying her in
Transactional memory
details
Message passing
Data parallelism Examples:
Locality • Send message: copy
data or swing a
Granularity
pointer?
Map/reduce • Memory fetch:
...on and on and on... uniform access or do
cache effects
Common theme: dominate?
the cost model matters – you can’t • Thread spawn: tens
just say “leave it to the system” of cycles or tens of
thousands of cycles?
no single cost model is right for all • Scheduling: can a
thread starve?
7. Goal: express the “natural structure” of a program
involving lots of concurrent I/O (eg a web serer, or
responsive GUI, or download lots of URLs in parallel)
Makes perfect sense with or without multicore
Most threads are blocked most of the time
Usually done with
Thread pools
Event handler
Message pumps
Really really hard to get right, esp when combined with
exceptions, error handling
NB: Significant steps forward in F#/C# recently: Async<T> See
http://channel9.msdn.com/blogs/pdc2008/tl11
8. Sole goal: performance using multiple cores
…at the cost of a more complicated program
#include “StdTalk.io”
Clock speeds not increasing
Transistor count still increasing
Delivered in the form of more cores
Often with an inadequate memory bandwidth
No alternative: the only way to ride Moore’s
law is to write parallel code
9. Use a functional language
But offer many different approaches to
parallel/concurrent programming, each with
a different cost model
Do not force an up-front choice:
Better one language offering many abstractions
…than many languages offer one each
(HPF, map/reduce, pthreads…)
10. This talk
Lots of different concurrent/parallel
Multicore programming paradigms (cost models)
in Haskell
Use Haskell!
Task parallelism Semi-implicit Data parallelism
Explicit threads, parallelism Operate simultaneously on
synchronised via locks, Evaluate pure bulk data
messages, or STM functions in
parallel
Modest parallelism Massive parallelism
Hard to program Modest parallelism Easy to program
Implicit synchronisation Single flow of control
Easy to program Implicit synchronisation
Slogan: no silver bullet: embrace diversity
11. Multicore
Parallel
programming
essential
Task parallelism
Explicit threads,
synchronised via locks,
messages, or STM
12. Lots of threads, all performing I/O
GUIs
Web servers (and other servers of course)
BitTorrent clients
Non-deterministic by design
Needs
Lightweight threads
A mechanism for threads to coordinate/share
Typically: pthreads/Java threads +
locks/condition variables
13. Very very lightweight threads
Explicitly spawned, can perform I/O
Threads cost a few hundred bytes each
You can have (literally) millions of them
I/O blocking via epoll => OK to have hundreds of
thousands of outstanding I/O requests
Pre-emptively scheduled
Threads share memory
Coordination via Software Transactional
Memory (STM)
14. main = do { putStr (reverse “yes”)
; putStr “no” }
• Effects are explicit in the type system
– (reverse “yes”) :: String -- No effects
– (putStr “no”) :: IO () -- Can have effects
• The main program is an effect-ful
computation
– main :: IO ()
15. newRef :: a -> IO (Ref a)
readRef :: Ref a -> IO a
writeRef :: Ref a -> a -> IO ()
main = do { r <- newRef 0 Reads and
; incR r writes are
; s <- readRef r 100% explicit!
; print s }
incR :: Ref Int -> IO () You can’t say
incR r = do { v <- readRef r (r + 6), because
; writeRef r (v+1) r :: Ref Int
}
16. forkIO :: IO () -> IO ThreadId
forkIO spawns a thread
It takes an action as its argument
webServer :: RequestPort -> IO ()
webServer p = do { conn <- acceptRequest p
; forkIO (serviceRequest conn)
; webServer p }
serviceRequest :: Connection -> IO ()
serviceRequest c = do { … interact with client … }
No event-loop spaghetti!
17. How do threads coordinate with each other?
main = do { r <- newRef 0
; forkIO (incR r)
; incR r
; ... } Aargh!
A race
incR :: Ref Int -> IO ()
incR r = do { v <- readRef r
; writeRef r (v+1) }
18. A 10-second review:
Races: due to forgotten locks
Deadlock: locks acquired in “wrong” order.
Lost wakeups: forgotten notify to condition
variable
Diabolical error recovery: need to restore
invariants and release locks in exception
handlers
These are serious problems. But even worse...
19. Scalable double-ended queue: one lock per cell
No interference if
ends “far enough”
apart
But watch out when the queue
is 0, 1, or 2 elements long!
21. Difficulty of concurrent
Coding style
queue
Sequential code Undergraduate
Locks and
Publishable result at
condition
international conference
variables
22. Difficulty of concurrent
Coding style
queue
Sequential code Undergraduate
Locks and
Publishable result at
condition
international conference
variables
Atomic blocks Undergraduate
23. atomically { ... sequential get code ... }
To a first approximation, just write the
sequential code, and wrap atomically around it
All-or-nothing semantics: Atomic commit
Atomic block executes in Isolation
Cannot deadlock (there are no locks!) ACID
Atomicity makes error recovery easy
(e.g. exception thrown inside the get code)
24. atomically :: IO a -> IO a
main = do { r <- newRef 0
; forkIO (atomically (incR r))
; atomically (incR r)
; ... }
atomically is a function, not a syntactic
construct
A worry: what stops you doing incR
outside atomically?
25. atomically :: STM a -> IO a
Better idea: newTVar :: a -> STM (TVar a)
readTVar :: TVar a -> STM a
writeTVar :: TVar a -> a -> STM ()
incT :: TVar Int -> STM ()
incT r = do { v <- readTVar r; writeTVar r (v+1) }
main = do { r <- atomically (newTVar 0)
; forkIO (atomically (incT r))
; atomic (incT r)
; ... }
26. atomic :: STM a -> IO a
newTVar :: a -> STM (TVar a)
readTVar :: TVar a -> STM a
writeTVar :: TVar a -> a -> STM ()
Can’t fiddle with TVars outside atomic
block [good]
Can’t do IO inside atomic block [sad,
but also good]
No changes to the compiler
(whatsoever). Only runtime system and
primops.
...and, best of all...
27. incT :: TVar Int -> STM ()
incT r = do { v <- readTVar r; writeTVar r (v+1) } Composition
incT2 :: TVar Int -> STM () is THE way
incT2 r = do { incT r; incT r } we build big
programs
foo :: IO () that work
foo = ...atomically (incT2 r)...
An STM computation is always executed atomically
(e.g. incT2). The type tells you.
Simply glue STMs together arbitrarily; then wrap with
atomic
No nested atomic. (What would it mean?)
28. MVars for efficiency in (very common)
special cases
Blocking (retry) and choice (orElse) in STM
Exceptions in STM
29. A very simple web server written in Haskell
full HTTP 1.0 and 1.1 support,
handles chunked transfer encoding,
uses sendfile for optimized static file serving,
allows request bodies and response bodies to be
processed in constant space
Protection for all the basic attack vectors:
overlarge request headers and slow-loris
attacks
500 lines of Haskell (building on some amazing
libraries: bytestring, blaze-builder, iteratee)
30. A new thread for each user request
Fast, fast
Pong requests/sec
31. Again, lots of threads: 400-600 is typical
Significantly bigger program: 5000 lines of
Haskell – but (Not shown: Vuse 480k lines)
way smaller 80,000
loc
than the
competition
Erlang
Haskell
Performance:
roughly
competitive
32. Built on STM
Heavy use of combinator parsers: “reads like
the protocol specification if you squint your
eyes”
decodeMsg :: Parser Message
decodeMsg = do
m <- getWord8
case m of
0 -> return Choke
1 -> return Unchoke
2 -> return Interested
3 -> return NotInterested
4 -> Have <$> gw32
5 -> BitField <$> getRestLazyByteString
6 -> Request <$> gw32 <*> gw32 <*> gw32
7 -> Piece <$> gw32 <*> gw32 <*> getRestLazyByteString
8 -> Cancel <$> gw32 <*> gw32 <*> gw32
9 -> Port <$> (fromIntegral <$> getWord16be)
_ -> fail "Incorrect message parse"
where gw32 = fromIntegral <$> getWord32bea
33. So far everything is shared memory
Distributed memory has a different cost model
Think message passing…
Think Erlang…
34. Processes share nothing; independent GC;
independent failure
Communicate over channels
Message communication = serialise to
bytestream, transmit, deserialise
Comprehensive failure model
A process P can “link to” another Q
If Q crashes, P gets a message
Use this to build process monitoring apparatus
Key to Erlang’s 5-9’s reliability
35. Provide Erlang as a library – no language
extensions needed
newChan :: PM (SPort a, RPort a)
send :: Serialisable a => SPort a -> a -> PM a
receive :: Serialisable a => RPort a -> PM a
spawn :: NodeId -> PM a -> PM PId
Process
Channels
May contain many
Haskell threads,
which share via STM
36. Many static guarantees for cost model:
(SPort a) is serialisable, but not (RPort a)
=> you always know where to send your message
(TVar a) not serialisable
=> no danger of multi-site STM
37. The k-means clustering algorithm takes a set of data points
and groups them into clusters by spatial proximity.
Initial clusters have After first iteration After third iteration
After second iteration
random centroids
●Start with Z lots of data points in N-dimensional space
●Randomly choose k points as ”centroid candidates”
●Repeat:
1. For each data point, find the nearerst ”centroid candidate”
2. For each candidate C, find the centroid of all points nearest to C
3. Make those the new centroid candidates, and repeat
Converged
38. ●Start with Z lots of data points in N-dimensional space
●Randomly choose k points as ”centroid candidates”
●Repeat:
1. For each data point, find the nearerst ”centroid candidate”
2. For each candidate C, find the centroid of all points nearest to C
3. Make those the new centroid candidates, and repeat if necessary
Step 1
MapReduce
Step 2
Mapper 1 Step 3
Reducer
Mapper 2 1 conver
Master … ged? Result
Mapper 3 Reducer
k
Mapper n
Running today in Haskell on an Amazon EC2 cluster [current work]
41. Side effects are inconvenient
do { v <- readTVar r; writeTVar r (v+1) }
vs
r++
Result: almost all the code is functional,
processing immutable data
Great for avoiding bugs: no aliasing, no race
hazards, no cache ping-ponging.
Great for efficiency: only TVar access are
tracked by STM
42. Multicore
Use Haskell!
Semi-implicit
parallelism
Evaluate pure
functions in
parallel
Modest parallelism
Implicit synchronisation
Easy to program
Slogan: no silver bullet: embrace diversity
43. Place n queens on an n x n board
such that no queen attacks any
other, horizontally, vertically, or
diagonally
Sequential code
nqueens :: Int -> [[Int]]
nqueens n = subtree n []
subtree :: Int -> [Int] -> [[Int]]
subtree 0 b = [b]
subtree c b = concat $
map (subtree (c-1)) (children b)
children :: [Int] -> [[Int]]
children b = [ (q:b) | q <- [1..n],
safe q b ]
44. Place n queens on an n x n board
such that no queen attacks any
other, horizontally, vertically, or
diagonally
[1,3,1] [1,1]
[2,3,1]
[2,1] [1]
[3,3,1]
Start
[4,3,1] [3,1] [] here
[5,3,1] [4,1]
[2]
[6,3,1]
...
...
...
45. Place n queens on an n x n board
such that no queen attacks any
other, horizontally, vertically, or
diagonally
Sequential code
nqueens :: Int -> [[Int]]
nqueens n = subtree n []
subtree :: Int -> [Int] -> [[Int]]
subtree 0 b = [b]
subtree c b = concat $
map (subtree (c-1)) (children b)
children :: [Int] -> [[Int]]
children b = [ (q:b) | q <- [1..n],
safe q b ]
46. Place n queens on an n x n board
such that no queen attacks any
other, horizontally, vertically, or
diagonally
Parallel code
Works on the
nqueens :: Int -> [[Int]] sub-trees in
nqueens n = subtree n [] parallel
subtree :: Int -> [Int] -> [[Int]]
subtree 0 b = [b]
subtree c b = concat $
parMap (subtree (c-1)) (children b)
children :: [Int] -> [[Int]]
children b = [ (q:b) | q <- [1..n],
safe q b ]
Speedup: 3.5x on 6 cores
47. map :: (a->b) -> [a] -> [b]
parMap :: (a->b) -> [a] -> [b]
Good things
Parallel program guaranteed not to change
the result
Deterministic: same result every run
Very low barrier to entry
“Strategies” to separate algorithm from
parallel structure
48. Bad things
Poor cost model; all too easy to fail to
evaluate something and lose all parallelism
Not much locality; shared memory
Over-fine granularity can be a big issue
Profiling tools can help a lot
50. Find authentication or secrecy failures in
cryptographic protocols. (Famous example: authentication
failure in the Needham-Schroeder public key protocol. )
About 6,500 lines of Haskell
“I think it would be moronic to code CPSA in C or Python. The
algorithm is very complicated, and the leap between the
documented design and the Haskell code is about as small as one
can get, because the design is functional.”
One call to parMap
Speedup of 3x on a quad-core --- worthwhile when
many problems take 24 hrs to run.
51. Modest but worthwhile speedups (3-10) for
very modest investment
Limited to shared memory; 10’s not 1000’s of
processors
You still have to think about a parallel
algorithm! (Eg John Ramsdell had to
refactor his CPSA algorithm a bit.)
52. Multicore
Use Haskell!
Data parallelism
Operate simultaneously on
bulk data
Massive parallelism
Easy to program
Single flow of control
Implicit synchronisation
Slogan: no silver bullet: embrace diversity
53. Data parallelism
The key to using multicores at scale
Flat data parallel Nested data parallel
Apply sequential Apply parallel
operation to bulk data operation to bulk data
Very widely used Research project
54. e.g. Fortran(s), *C
MPI, map/reduce
The brand leader: widely used, well
understood, well supported
foreach i in 1..N {
...do something to A[i]...
}
BUT: “something” is sequential
Single point of concurrency
Easy to implement:
use “chunking”
Good cost model
(both granularity and 1,000,000’s of (small) work items
P1 P2 P3
locality)
55. Faces are compared by computing a distance 1 R v v
dist(A,B)
A B
between their multi-region histograms. h h
R r1 r r
1
Multi-region histogram for
candidate face as an array.
A r=1 r=2 r=3 r=4
56. 1 R v v
dist(A,B)
v A B
h
A
replicate h
R r1 h r r
1
v A
h
zipWith reduce reduce map
v A v B 1
R
v A v B h h
r 1
h R
v B h 1
h
57. 1 v v
r 1
v R A B
dist(A,B) hr
A
replicate h
R h r
1
v A
h
zipWi reduce reduce map
th
v A v B 1
R
v A v B h h
r 1
h R
v B h 1
h
distances :: Array DIM2 Float -> Array DIM3 Float
v
B
h
-> Array DIM1 Float
distances histA histBs = dists
where
histAs = replicate (constant (All, All, f)) histA
diffs = zipWith (-) histAs histBs
l1norm = reduce (¥a b -> abs a + abs b) (0) diffs
regSum = reduce (+) (0) l1norm
dists = map (/ r) regSum
(h, r, f) = shape histBs
58. Arrays as values: virtually no element-wise
programming (for loops).
Think APL, but with much more
polymorphism
Performance is (currently) significantly less
than C
BUT it
auto-parallelises
Warning: take all such figures with buckets of salt
59. GPUs are massively parallel processors, and
are rapidly de-specialising from graphics
Idea: your program (when run) generates a
GPU program
distances :: Acc (Array DIM2 Float)
-> Acc (Array DIM3 Float)
-> Acc (Array DIM1 Float)
distances histA histBs = dists
where
histAs = replicate (constant (All, All, f)) histA
diffs = zipWith (-) histAs histBs
l1norm = reduce (¥a b -> abs a + abs b) (0) diffs
regSum = reduce (+) (0) l1norm
dists = map (/ r) regSum
60. An (Acc a) is a syntax tree for a program
computing a value of type a, ready to be
compiled for GPU
The key trick: (+) :: Num a => a –> a -> a
distances :: Acc (Array DIM2 Float)
-> Acc (Array DIM3 Float)
-> Acc (Array DIM1 Float)
distances histA histBs = dists
where
histAs = replicate (constant (All, All, f)) histA
diffs = zipWith (-) histAs histBs
l1norm = reduce (¥a b -> abs a + abs b) (0) diffs
regSum = reduce (+) (0) l1norm
dists = map (/ r) regSum
61. An (Acc a) is a syntax tree for a program
computing a value of type a, ready to be
compiled for GPU
CUDA.run :: Acc (Array a b) -> Array a b
CUDA.run
takes the syntax tree
compiles it to CUDA
loads the CUDA into GPU
marshals input arrays into GPU memory
runs it
marshals the result array back into Haskell memory
62. The code for Repa (multicore) and
Accelerate (GPU) is virtually identical
Only the types change
Other research projects with similar
approach
Nicola (Harvard)
Obsidian/Feldspar (Chalmers)
Accelerator (Microsoft .NET)
Recursive islands (MSR/Columbia)
63. Data parallelism
The key to using multicores at scale
Nested data parallel
Apply parallel
operation to bulk data
Research project
64. Main idea: allow “something” to be parallel
foreach i in 1..N {
...do something to A[i]...
}
Now the parallelism
structure is recursive,
and un-balanced
Much more expressive
Much harder to implement
Still 1,000,000’s of (small) work items
65. Nested data Flat data
parallel parallel
program Compiler program
(the one we want (the one we want
to write) to run)
Invented by Guy Blelloch in the 1990s
We are now working on embodying it in GHC:
Data Parallel Haskell
Turns out to be jolly difficult in practice
(but if it was easy it wouldn’t be research).
Watch this space.
66. No single cost model suits all programs /
computers. It’s a complicated world. Get used
to it.
For concurrent programming, functional
programming is already a huge win
For parallel programming at scale, we’re going to
end up with data parallel functional programming
Haskell is super-great because it hosts multiple
paradigms. Many cool kids hacking in this space.
But other functional programming languages are
great too: Erlang, Scala, F#