Parallel Functional Programming in Java 8
Parallel Functional Programming in Java 8
Programming in Java 8
Peter Sestoft
IT University of Copenhagen
IT University of Copenhagen 1
The speaker
• MSc 1988 computer science and mathematics and
PhD 1991, DIKU, Copenhagen University
• KU, DTU, KVL and ITU; and Glasgow U, AT&T Bell
Labs, Microsoft Research UK, Harvard University
• Programming languages, software development, ...
• Open source software
– Moscow ML implementation, 1994…
– C5 Generic Collection Library, with Niels Kokholm, 2006…
– Funcalc spreadsheet implementation, 2014
1993 2002, 2005, 2016 2004 & 2012 2007 2012, 2017 2014
Plan
• Java 8 functional programming
– Package java.util.function
– Lambda expressions, method reference expressions
– Functional interfaces, targeted function type
• Java 8 streams for bulk data
– Package java.util.stream
• High-level parallel programming
– Streams: primes, queens, van der Corput, …
– Array parallel prefix operations
• Class java.util.Arrays static methods
• A multicore performance mystery
IT University of Copenhagen 3
Materials
• Java Precisely 3rd edition, MIT Press 2016
– 11.13: Lambda expressions
– 11.14: Method reference expressions
– 23: Functional interfaces
– 24: Streams for bulk data
– 25: Class Optional<T>
IT University of Copenhagen 4
New in Java 8
• Lambda expressions
(String s) -> s.length
• Method reference expressions
String::length
• Functional interfaces
Function<String,Integer>
• Streams for bulk data
Stream<Integer> is = ss.map(String::length)
• Parallel streams
is = ss.parallel().map(String::length)
• Parallel array operations
Arrays.parallelSetAll(arr, i -> sin(i/PI/100.0))
Arrays.parallelPrefix(arr, (x, y) -> x+y)
IT University of Copenhagen 5
Functional programming in Java
• Immutable data instead of objects with state
• Recursion instead of loops
• Higher-order functions that either
– take functions as argument
– return functions as result
Immutable
Example154.java
class FunList<T> { list of T
final Node<T> first;
protected static class Node<U> {
public final U item;
public final Node<U> next;
public Node(U item, Node<U> next) { ... }
}
...
}
IT University of Copenhagen 6
Immutable data
• FunList<T>, linked lists of nodes
class FunList<T> {
final Node<T> first;
protected static class Node<U> {
Example154.java
public final U item;
public final Node<U> next;
public Node(U item, Node<U> next) { ... }
}
List of Integer
list1 9 13 0
Head Tail
7
Existing data do not change
FunList<Integer> empty = new FunList<>(null),
list1 = cons(9, cons(13, cons(0, empty))),
Example154.java
list2 = cons(7, list1),
list3 = cons(8, list1),
list4 = list1.insert(1, 12),
list5 = list2.removeAt(3);
list1 9 13 0
list2 7
list3
8
list4 9 12
list5 7 9 13
8
Recursion in insert
public FunList<T> insert(int i, T item) {
Example154.java
return new FunList<T>(insert(i, item, this.first));
}
IT University of Copenhagen 9
Immutable data: Bad and good
• Immutability leads to more allocation
– Takes time and space
– But modern garbage collectors are fast
• Immutable data can be safely shared
– May actually reduce amount of allocation
• Immutable data are automatically threadsafe
– No (other) thread can mess with it
– And also due to visibility effects of final modifier
Subtle point
IT University of Copenhagen 10
Lambda expressions 1
Example64.java
• One argument lambda expressions:
Function<String,Integer>
fsi1 = s -> Integer.parseInt(s);
IT University of Copenhagen 11
Lambda expressions 2
• Zero-argument lambda expression:
Example64.java
Supplier<String>
now = () -> new java.util.Date().toString();
Consumer<String>
show2 = s -> { System.out.println(">>>" + s + "<<<"); };
IT University of Copenhagen 12
Method reference expressions
BiFunction<String,Integer,Character> charat
Example67.java
= String::charAt;
Same as (s,i) -> s.charAt(i)
System.out.println(charat.apply("ABCDEF", 1));
Function<Integer,Character> hex1
= "0123456789ABCDEF"::charAt;
Example154.java
return new FunList<U>(map(f, first));
}
static <T,U> Node<U> map(Function<T,U> f, Node<T> xs) {
return xs == null ? null
: new Node<U>(f.apply(xs.item), map(f, xs.next));
}
IT University of Copenhagen 15
Calling map
7 9 13
IT University of Copenhagen 16
Functions as arguments: reduce
static <T,U> U reduce(U x0, BiFunction<U,T,U> op, Node<T> xs) {
return xs == null ? x0
: reduce(op.apply(x0, xs.item), op, xs.next);
}
• list.reduce(x0, op)
= x0vx1v...vxn
if we write op.apply(x,y) as xvy
Example154.java
• Example: list.reduce(0, (x,y) -> x+y)
= 0+x1+...+xn
IT University of Copenhagen 17
Calling reduce
17.5 22.5 32.5
Example154.java
double sum = list8.reduce(0.0, (res, item) -> res + item);
72.5
12796.875
boolean allBig
= list8.reduce(true, (res, item) -> res && item > 10);
true
IT University of Copenhagen 18
Tail recursion and loops
static <T,U> U reduce(U x0, BiFunction<U,T,U> op, Node<T> xs) {
return xs == null ? x0
: reduce(op.apply(x0, xs.item), op, xs.next);
}
Tail call
Example154.java
xs = xs.next;
} Loop version
return x0; of reduce
}
Type of functions
interface Consumer<T> { from T to void
void accept(T x);
} C#: Action<T>
IT University of Copenhagen 20
(Too) many functional interfaces
interface IntFunction<R> {
R apply(int x);
}
Use instead of
Function<Integer,R>
to avoid (un)boxing
Primitive-type
21
Primitive-type specialized interfaces
for int, double, and long
interface Function<T,R> {
R apply(T x);
}
Why
interface IntFunction<R> { both?
R apply(int x);
} What difference?
Function<Integer,String> f1 = i -> "#" + i;
IntFunction<String> f2 = i -> "#" + i;
Example158.java
return n<20 ? ones[(int)n] Convert n < 100
: tens[(int)n/10-2] + after("-", ones[(int)n%10]);
}
static LongFunction<String> less(long limit, String unit,
LongFunction<String> conv) {
return n -> n<limit ? conv.apply(n)
: conv.apply(n/limit) + " " + unit
+ after(" ", conv.apply(n%limit));
} 23
Functions that return functions
• Using the general higher-order function
static final LongFunction<String>
Example158.java
less1K = less( 100, "hundred", Example158::less100),
less1M = less( 1_000, "thousand", less1K),
less1B = less( 1_000_000, "million", less1M),
less1G = less(1_000_000_000, "billion", less1B);
toEnglish(2147483647)
IT University of Copenhagen 24
Streams for bulk data
• Stream<T> is a finite or infinite sequence of T
– Possibly lazily generated
– Possibly parallel
• Stream methods
– map, flatMap, reduce, filter, ...
– These take functions as arguments
– Can be combined into pipelines
– Java optimizes (and parallelizes) the pipelines well
• Similar to
– Java Iterators, but very different implementation
– The extension methods underlying .NET Linq
IT University of Copenhagen 25
Some stream operations
• Stream<Integer> s = Stream.of(2, 3, 5)
• s.filter(p) = the x where p.test(x) holds
s.filter(x -> x%2==0) gives 2
• s.map(f) = results of f.apply(x) for x in s
s.map(x -> 3*x) gives 6, 9, 15
• s.flatMap(f) = a flattening of the streams
created by f.apply(x) for x in s
s.flatMap(x -> Stream.of(x,x+1)) gives 2,3,3,4,5,6
• s.findAny() = some element of s, if any, or else
the absent Option<T> value
s.findAny() gives 2 or 3 or 5
• s.reduce(x0, op) = x0vs0v...vsn if we write
op.apply(x,y) as xvy
s.reduce(1, (x,y)->x*y) gives 1*2*3*5 = 30
26
Similar functions are everywhere
• Java stream map is called
– map in Haskell, Scala, F#, Clojure
– Select in C#
• Java stream flatMap is called
– concatMap in Haskell
– flatMap in Scala
– collect in F#
– SelectMany in C#
– mapcat in Clojure
• Java reduce is a special (assoc. op.) case of
– foldl in Haskell
– foldLeft in Scala
– fold in F#
– Aggregate in C#
– reduce in Clojure
IT University of Copenhagen 27
Counting primes on Java 8 streams
• Our old standard Java for loop:
int count = 0;
for (int i=0; i<range; i++) Classical efficient
imperative loop
if (isPrime(i))
count++;
• Sequential Java 8 stream:
IntStream.range(0, range)
Pure functional
.filter(i -> isPrime(i))
programming ...
.count()
This means
”catastrophic”
IT University of Copenhagen 30
Creating streams 1
• Explicitly or from array, collection or map:
IntStream is = IntStream.of(2, 3, 5, 7, 11, 13);
Example164.java
String[] a = { "Hoover", "Roosevelt", ...};
Stream<String> presidents = Arrays.stream(a);
31
Creating streams 2
• Useful special-case streams:
• IntStream.range(0, 10_000)
• random.ints(5_000)
Example164.java
• bufferedReader.lines()
• bitset.stream()
• Functional iterators for infinite streams
• Imperative generators for infinite streams
• StreamBuilder<T>: eager, only finite streams
IT University of Copenhagen 32
Creating streams 3: generators
• Generating 0, 1, 2, 3, ... Functional
Example165.java
IntStream nats1 = IntStream.iterate(0, x -> x+1);
IT University of Copenhagen 33
Creating streams 4: StreamBuilder
• Convert own linked IntList to an IntStream
class IntList {
public final int item;
public final IntList next;
...
public static IntStream stream(IntList xs) {
IntStream.Builder sb = IntStream.builder();
while (xs != null) {
Example182.java
sb.accept(xs.item);
xs = xs.next;
}
return sb.build();
}
}
Example175.java
else
return todo.stream().boxed()
.flatMap(r -> perms(minus(todo, r), new IntList(r, tail)));
}
n n n
n n n
n n n
[2, 1, 0] [1, 2, 0] [2, 0, 1]
n n n
n n n
n n n
[0, 2, 1] [0, 1, 2] [1, 0, 2]
IT University of Copenhagen 37
Solutions to the n-queens problem
• For queens, just take diagonals into account:
– consider only r that are safe for the partial solution
public static Stream<IntList> queens(BitSet todo, IntList tail) {
if (todo.isEmpty())
Example176.java
return Stream.of(tail); Diagonal
else check
return todo.stream()
.filter(r -> safe(r, tail)).boxed()
.flatMap(r -> queens(minus(todo, r), new IntList(r, tail)));
}
public static boolean safe(int mid, IntList tail) {
return safe(mid+1, mid-1, tail);
}
.parallel()
public static boolean safe(int d1, int d2, IntList tail) {
return tail==null || d1!=tail.item && d2!=tail.item && safe(d1+1, d2-1, tail.next);
}
Example174.java
System.out.println(queens(8).findAny());
– Print the 20 first solutions
queens(8).limit(20).forEach(System.out::println);
Example183.java
return IntStream.range(1, 31).asDoubleStream()
.flatMap(b -> bitReversedRange((int)b));
}
Example182.java
public String toString() {
return stream(this).mapToObj(String::valueOf)
.collect(Collectors.joining(",", "[", "]"));
}
IT University of Copenhagen 43
Parallel (functional) array operations
• Simulating random motion on a line
– Take n random steps of length at most [-1, +1]:
Example25.java
double[] a = new Random().doubles(n, -1.0, +1.0)
.toArray();
IT University of Copenhagen 45
Some problems with Java streams
• Streams are use-once & have other restrictions
– Probably to permit easy parallelization
• Hard to create lazy finite streams
– Probably to allow high-performance implementation
• Difficult to control resource consumption
• A single side-effect may mess all up completely
• Sometimes .parallel() hurts performance a lot
– See exercise
– And strange behavior, in parallel + limit in Sudoku generator
• Laziness in Java is subtle, easily goes wrong:
static Stream<String> getPageAsStream(String url) throws IOException {
Example216.java
try (BufferedReader in
= new BufferedReader(new InputStreamReader(
new URL(url).openStream()))) {
return in.lines();
} Closes the reader too early, so any
use of the Stream<String> causes
}
IOException: Stream closed Useless
46
2P
A multicore performance mystery
• K-means clustering 2P: Assign – Update –
Assign – Update … till convergence
while (!converged) { Pseudocode
TestKMeansSolution.java
let taskCount parallel tasks do { Assign
final int from = ..., to = ...;
for (int pi=from; pi<to; pi++)
myCluster[pi] = closest(points[pi], clusters);
}
let taskCount parallel tasks do { Update
final int from = ..., to = ...;
for (int pi=from; pi<to; pi++)
myCluster[pi].addToMean(points[pi]);
}
...
} Imperative
• Assign: writes a point to myCluster[pi]
• Update: calls addToMean on myCluster[pi]
47
2Q
A multicore performance mystery
• ”Improved” version 2Q:
– call addToMean directly on point
– instead of first writing it to myCluster array
while (!converged) {
let taskCount parallel tasks do {
final int from = ..., to = ...;
for (int pi=from; pi<to; pi++)
closest(points[pi], clusters).addToMean(points[pi]);
}
...
}
IT University of Copenhagen 48
Performance of k-means clustering
• Sequential: as you would expect, 5% speedup
• Parallel: surprisingly bad!
2P 2Q 2Q/2P
Sequential 4.240 4.019 0.95 Bad
4-core parallel 1.310 2.234 1.70
24-core parallel 0.852 6.587 7.70 Very
bad
Time in seconds for 200,000 points, 81 clusters, 1/8/48 tasks, 108 iterations
IT University of Copenhagen 49
The Point and Cluster classes
class Point {
public final double x, y;
}
Cluster object
mean sumx sumy count layout (maybe)
50
KMeans 2P
• Assignment step
– Reads each Cluster’s mean field 200,000 times
– Writes only myCluster array segments, separately
– Takes no locks at all
• Update step
– Calls addToMean 200,000 times
– Writes the 81 clusters’ sumx, sumy, count fields
200,000 times in total
– Takes Cluster object locks 200,000 times
IT University of Copenhagen 51
KMeans 2Q
• Unified loop
– Reads each Cluster’s mean field 200,000 times
– Calls addToMean 200,000 times and writes the
sumx, sumy, count fields 200,000 times in total
– Takes Cluster object locks 200,000 times
• Problem in 2Q:
– mean reads are mixed with sumx, sumy, ... writes
– The writes invalidate the cached mean field
– The 200,000 mean field reads become slower
– False sharing: mean and sumx on same cache line
– (A problem on Intel i7, not on 20 x slower ARM A7)
• See http://www.itu.dk/people/sestoft/papers/cpucache-20170319.pdf
IT University of Copenhagen 52
3P
Parallel streams to the rescue, 3P
while (!converged) {
• fff
final Cluster[] clustersLocal = clusters;
Map<Cluster, List<Point>> groups = Assign
Arrays.stream(points).parallel()
.collect(Collectors.groupingBy(p -> closest(p,clustersLocal)));
clusters = groups.entrySet().stream().parallel()
.map(kv -> new Cluster(kv.getKey().getMean(), kv.getValue()))
.toArray(Cluster[]::new);
Cluster[] newClusters = Update
Arrays.stream(clusters).parallel()
.map(Cluster::computeMean).toArray(Cluster[]::new);
converged = Arrays.equals(clusters, newClusters);
clusters = newClusters;
} Functional
2P 2Q 3P
Sequential 4.240 4.019 5.353
4-core parallel i7 1.310 2.234 1.350
24-core parallel Xeon 0.852 6.587 0.553
Time in seconds for 200,000 points, 81 clusters, 1/8/48 tasks, 108 iterations
53
Exercise: Streams & floating-point
sum
• Compute series sum:
for N=999,999,999
• For-loop, forwards summation
double sum = 0.0;
TestStreamSums.java
for (int i=1; i<N; i++)
sum += 1.0/i;
Different
• For-loop, backwards summation results!
results?
double sum = 0.0;
for (int i=1; i<N; i++)
sum += 1.0/(N-i);
• Exercises
– Extend immutable list class with functional
programming; use parallel array operations; use
streams of words and streams of numbers
– Alternatively: Make a faster and more scalable k-
means clustering implementation, if possible, in
any language
IT University of Copenhagen 57