Introducing BinarySortedMultiMap - A new Flink state primitive to boost your application performance

Introducing
BinarySortedState
A new Flink state
primitive to boost
your application
performance
Nico Kruber
–
Software
Engineer
——
David Anderson
–
Community
Engineering
–
Flink Forward
22

About me
Open source
● Apache Flink contributor/committer since 2016
○ Focus on network stack, usability, and performance
Career
● PhD in Computer Science at HU Berlin / Zuse Institute Berlin
● Software engineer -> Solutions architect -> Head of Solutions Architecture
@ DataArtisans/Ververica (acquired by Alibaba)
● Engineering @ Immerok
About Immerok
● Building a fully managed Apache Flink cloud service
for powering real-time systems at any scale
○ immerok.com
2

Agenda
● Motivation
● BinarySortedState
● Results & Future Work

Use Case: Stream Sort
20
15
11
30 35
27
21
40
41
35
1
2
5
7 6
9
10
11
13
14

Use Case: Stream Sort - Code
void processElement(Long event, /*...*/) {
TimerService timerSvc = ctx.timerService();
long ts = ctx.timestamp();
if (!isLate(ts, timerSvc)) {
List<Long> listAtTs = events.get(ts);
if (listAtTs == null) {
listAtTs = new ArrayList<>();
}
listAtTs.add(event);
events.put(ts, listAtTs);
timerSvc.registerEventTimeTimer(ts);
}
}
MapState<Long, List<Long>> events;
MapStateDescriptor<Long, List<Long>> desc =
new MapStateDescriptor<>(
"Events",
Types.LONG,
Types.LIST(Types.LONG));
events = getRuntimeContext().getMapState(desc);
void onTimer(long ts, /*...*/) {
events.get(ts).forEach(out::collect);
events.remove(ts);
}

}
Use Case: Stream Sort - What’s
Happening Underneath
State (RocksDB)
De-/Serializer
full list
as byte[]
search memtable
+ sst files
for one entry
lookup
deserialized
(Java) list

}
Use Case: Stream Sort - What’s
State (RocksDB)
De-/Serializer
full Java list
serialized list
as byte[]
add new entry
to memtable
(leave old one for
compaction)

Use Case: Stream Sort - Alternative Solutions
● Using MapState<Long, Event> instead of MapState<Long, List<Event>>?
○ Cannot handle multiple events per timestamp
● Using Window API?
○ Efficient event storage per timestamp
○ No really well-matching window types: sliding, tumbling, and session windows
● Using HashMapStateBackend?
○ No de-/serialization overhead
○ state limited by available memory
○ no incremental checkpoints
● Using ListState<Event> and filtering in onTimer()?
○ Reduced overhead in processElement() vs. more to do in onTimer()

{ts: 5, code: GBP,
rate: 1.20}
{ts: 10, code: USD,
rate: 1.00}
{ts: 19, code: USD,
rate: 1.02}
rates
{ts: 15, code: USD,
amount: 1.00}
{ts: 10, code: GBP,
amount: 2.00}
{ts: 25, code: USD,
amount: 3.00}
transactions
Use Case: Event-Time Stream Join
{ts1: 15, ts2: 10,
amount: 1.00}
{ts1: 10, ts2: 5,
amount: 2.40}
{ts1: 25, ts2: 19,
amount: 3.06}
SELECT
t.ts AS ts1, r.ts AS ts2,
t.amount * r.rate AS amount
FROM transactions AS t
LEFT JOIN rates
FOR SYSTEM_TIME AS OF t.ts AS r
ON t.code = r.code;

TreeSet<Entry<Long, Double>> rates =
ratesInRange(ts, rate.entries());
Double myRate = getLast(rates);
transactions.get(ts).forEach(
tx -> out.collect(
new Joined(myRate, tx)));
deleteAllButLast(rates);
transactions.remove(ts);
}
Use Case: Event-Time Stream Join - Code
void processElement1(Transaction value,/*...*/) {
addTransaction(ts, value);
}
}
// similar for processElement2()
MapState<Long, List<Transaction>>
transactions;
MapState<Long, Double> rate;

Double myRate = getLast(rates);
tx -> out.collect(
deleteAllButLast(rates);
}
With RocksDB:
● always fetching all rates’ (key+value) bytes ⚠
● (also need to fit into memory ⚠)
● deserialize all keys keys during iteration ⚠
○ not deserializing values (at least) ✓
Use Case: Event-Time Stream Join - Code
void processElement1(Transaction value,/*...*/) {
}
}
Similar to stream sort, with RocksDB:
● always fetch/write full lists ⚠
● always de-/serialize full list ⚠
● additional stress on RocksDB compaction ⚠
MapState<Long, List<Transaction>>
transactions;
MapState<Long, Double> rate;

BinarySortedState - History
“Temporal state” Hackathon project (David + Nico + Seth)
● Main primitives: getAt(), getAtOrBefore(), getAtOrAfter(), add(),
addAll(), update()
2020
Nov 2021
April 2022
started as a Hackathon project (David + Nico) on
custom windowing with process functions
Created FLIP-220 and discussed on dev@flink.apache.org
● Extended scope further to allow arbitrary user keys (not just timestamps)
○ Identified further use cases in SQL operators, e.g. min/max with retractions
● Clarified serializer requirements
● Extend proposed API to offer range read and clear operations
● …

● A new keyed-state primitive, built on top of ListState
● Efficiently add to list of values for a user-provided key
● Efficiently iterate user-keys in a well-defined sort order,
with native state-backend support, especially RocksDB
● Efficient operations for time-based functions
(windowing, sorting, event-time joins, custom, ...)
● Operations on subset of the state, based on user-key ranges
● Portable between state backends (RocksDB, HashMap)
BinarySortedState - Goals

BinarySortedState - API (subject to change!)
● Point-operations:
○ valuesAt(key)
○ add(key, value)
○ put(key, values)
○ clearEntryAt(key)
● Lookups:
○ firstEntry(), firstEntry(fromKey)
○ lastEntry(), lastEntry(UK toKey)
● Cleanup:
○ clear()
● Range-operations:
○ readRange(fromKey, toKey,
inclusiveToKey)
○ readRangeUntil(toKey, inclusiveToKey)
○ readRangeFrom(fromKey)
● Range-deletes:
○ clearEntryAt(key)
○ clearRange(
fromKey, toKey, inclusiveToKey)
○ clearRangeUntil(toKey, inclusiveToKey)
○ clearRangeFrom(fromKey)
BinarySortedState<UK, UV>

● RocksDB is a key-value store writing into MemTables → flushing into SST files
● SST files are sorted by the key in lexicographical binary order
(byte-wise unsigned comparison)
BinarySortedState - How does it work with RocksDB?!
● RocksDB offers Prefix Seek and SeekForPrev
● RocksDBMapState.RocksDBMapIterator provides efficient iteration via:
○ Fetching up to 128 RocksDB entries at once
○ RocksDBMapEntry with lazy key/value deserialization

BinarySortedState - LexicographicalTypeSerializer
● “Just” need to provide serializers that are compatible with RocksDB’s sort order
● Based on lexicographical binary order as defined by byte-wise unsigned comparison
● Compatible serializers extend LexicographicalTypeSerializer
public abstract class LexicographicalTypeSerializer<T> extends TypeSerializer<T> {
public Optional<Comparator<T>> findComparator() { return Optional.empty(); }
}

Stream Sort w/out BinarySortedState (1)
private BinarySortedState<Long, Long> events;
BinarySortedStateDescriptor<Long, Long> desc =
new BinarySortedStateDescriptor<>(
"Events",
LexicographicLongSerializer.INSTANCE,
Types.LONG);
events = getRuntimeContext()
.getBinarySortedState(desc);
private MapState<Long, List<Long>> events;
MapStateDescriptor<Long, List<Long>> desc =
new MapStateDescriptor<>(
"Events",
Types.LONG,
Types.LIST(Types.LONG));
events = getRuntimeContext()
.getMapState(desc);
public void onTimer(long ts, /*...*/) {
events.valuesAt(ts).forEach(out::collect);
events.clearEntryAt(ts);
}
events.get(ts).forEach(out::collect);
events.remove(ts);
}

Stream Sort w/out BinarySortedState (2)
public void processElement(/*...*/) {
// ...
events.add(ts, event);
// ...
}
public void processElement(/*...*/) {
// ...
}
// ...
}

events.add(ts, event);
Stream Sort with
BinarySortedState - What’s
State (RocksDB)
De-/Serializer
add new merge
op to memtable
Serialized
entry as byte[]
new list
entry

Stream Sort with
State (RocksDB)
De-/Serializer
full list
as byte[]
Lookup
deserialized
(Java) list
search memtable
+ sst files
for all entries
events.valuesAt(ts).forEach(out::collect);

Mark k/v as
deleted
(removal during
compaction)
Stream Sort with
State (RocksDB)
De-/Serializer
events.clearEntryAt(ts);
Delete

Event-Time Stream Join w/out BinarySortedState (1)
private BinarySortedState<Long, Transaction>
transactions;
private BinarySortedState<Long, Double> rate;
private MapState<Long, List<Transaction>>
transactions;
private MapState<Long, Double> rate;
public void processElement1(/*...*/) {
// ...
// append to BinarySortedState:
transactions.add(ts, value);
}
}
public void processElement1(/*...*/) {
// ...
// replace list in MapState:
}
}

Event-Time Stream Join w/out BinarySortedState (2)
Entry<Long, Iterable<Double>> myRate =
rate.lastEntry(ts);
Double rateVal = Optional
.ofNullable(myRate)
.map(e -> e.getValue().iterator().next())
.orElse(null);
transactions.valuesAt(ts).forEach(
tx -> out.collect(
new Joined(rateVal, tx)));
if (myRate != null) {
rate.clearRangeUntil(myRate.getKey(), false);
}
transactions.clearEntryAt(ts);
}
Double myRate = Optional
.ofNullable(rates.pollLast())
.map(Entry::getValue)
.orElse(null);
tx -> out.collect(
rates.forEach(this::removeRate);
}

Stream Sort with BinarySortedState - Optimized
● Idea:
○ Increase efficiency by processing all events between watermarks
● Challenge:
○ Registering a timer for the next watermark will fire too often
➔ Solution:
○ Register timer for the first unprocessed event
○ When the timer fires:
■ Process all events until the current watermark (not the timer timestamp!)
● events.readRangeUntil(currentWatermark, true)

Event-Time Stream Join with BinarySortedState - Optimized
● Idea:
○ Increase efficiency by processing all events between watermarks
● Challenge:
○ Registering a timer for the next watermark will fire too often
➔ Solution:
○ Same as Stream Sort: Timer for first unprocessed event, processing until watermark, but:
○ When the timer fires:
■ Iterate both, transactions and rate (in the appropriate time range) in event-time order

● (Custom) stream sorting
● Time-based (custom) joins
● Code with range-reads or bulk-deletes
● Custom window implementations
● Min/Max with retractions
● …
● Basically everything maintaining a MapState<?, List<?>> or requiring range operations
BinarySortedState - Who will benefit?

What’s left to do?
● Iron out last bits and pieces + tests
○ Start voting thread for FLIP-220 on dev@flink.apache.org
○ Create a PR and get it merged
● Expected to land in Flink 1.17 (as experimental feature)
● Port Table/SQL/DataStream operators to improve efficiency:
○ TemporalRowTimeJoinOperator (PoC already done for validating the API ✓)
○ RowTimeSortOperator
○ IntervalJoinOperator
○ CepOperator
○ …
● Provide more LexicographicalTypeSerializers

Get ready to ROK!!!
Nico Kruber
linkedin.com/in/nico-kruber
nico@immerok.com

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your application performance

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to Introducing BinarySortedMultiMap - A new Flink state primitive to boost your application performance

Similar to Introducing BinarySortedMultiMap - A new Flink state primitive to boost your application performance (20)

More from Flink Forward

More from Flink Forward (13)

Recently uploaded

Recently uploaded (20)

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your application performance