Evening out the uneven: dealing with skew in Flink

•Download as PPTX, PDF•

2 likes•3,299 views

Flink Forward San Francisco 2022. When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment. by Jun Qin & Karl Friedrich

Evening Out the Uneven:
Dealing with Skews in Flink
Jun Qin, Head of Solutions Architecture
Karl Friedrich, Architect
1

Contents
01 Skew & Its Impact
02 Data Skew
03 Key Skew
04 State Skew
05 Scheduling Skew
06 Event Time/Watermark
Skew
07 Key Takeaways
2

Skew
● Workload imbalance among
subtasks/TaskManagers
○ in data to be processed
○ in state size
○ in event time
○ in resource usage
■ CPU/Memory/Disk
3

● Less resource utilization
● Back pressure
● Low throughput and/or high latency
● Potential high memory usage
○ JVM Garbage Collection
○ TM heartbeats timeout
○ Task failure
○ Job restart
Impact of Skew
4

● File Source: some files are much larger than
than the others
● Kafka Source: some Kafka partitions hold much
more data than the others
Examples
Data Skew
10x
5

Data Skew(Cont’d)
One task/TM is much busier that the others:
Bytes/Records Received is skewed:
Sympotoms
6

● Option 1: (basic) do filter() in your pipeline as early as possible to reduce the data volume
sourceStream
.assignTimestampsAndWatermarks()
.rebalance()
.map()
.otherTransformations()
.addSink()/sinkTo()
Data Skew (Cont’d)
Solutions
● Option 3: implement a custom operator to do a hadoop style map side aggregation (aka.
combiner)
○ See the class MapBundleOperator in flink-table-runtime
● Option 2: re-partition data by calling
○ rebalance()
○ shuffle()
○ partitionCustom()
○ keyBy()
⇨ pay attention to the shuffle cost!!!
7

Data Skew (Cont’d)
Throughput improved by using rebalance()
Rebalance then consume
Consume directly
➤ If the cost of network shuffle is
significant comparing the actual data
processing, you may get worse
results!
8

Key Skew
Record - Key - Key Group - taskSlot
record
key
keyGroup
taskSlot
KeySelector
MathUtils.murmurHash(key.hashCode())
% maxParallelism
keyGroup * parallelism / maxParallelism
maxParallelism determines the
total number of keyGroups
parallelism determines how the
keyGroups are grouped together
and assigned to a taskSlot
KeySelector determine
whether there is a hot
key
9

Key Skew (Cont’d)
Hot keys
record keyGroup
key
…
taskSlot 1: 1 record
taskSlot 2: 8 records
taskSlot 3: 2 record
taskSlot 4: 3 records
hot key!
1
2
0
4
5
3
6
7
8
9
10
11
A
B
C
D
maxParallelim=12
parallelism=4
10

Key Skew (Cont’d)
● Option 1: (basic) use a different key that has less or no skew.
Solutions of hot keys
➤ Flink SQL does this automatically if table.optimizer.agg-phase-strategy = TWO_PHASE
● Option 2: (general aggregation) local-global aggregation, similar to Hadoop combiner
11

Key Skew (Cont’d)
● Option 3: two-stage keyBy
Solutions of hot keys
stream
.keyBy(key)
.process(...)
stream
.keyBy(randomizedKey)
.process(...)
.keyBy(key)
.process(...)
➤ randomizedKey must be deterministic:
key + hashCode(anotherField)
splitStream = sourceStream
.process()
// stream of records with hot keys
splitStream
.getSideOut()
.keyBy(anotherField)
.process(...)
.keyBy(key)
.process(...)
// stream of records with non-hot keys
splitStream
.keyBy(key)
.process(...)
● Option 4: If hot keys are known in advance,
split the input stream in your pipeline into
hot key streams and non-hot key streams.
12

Key Skew (Cont’d)
Hot keys: experimental results
Stream split
One keyBy Two keyBy
skew
13

Key Skew (Cont’d)
Hot keys
Stream split
One keyBy Two keyBy
14

Key Skew (Cont’d)
Multiple keys are mapped to the same keyGroup
record keyGroup
key
…
taskSlot 1: 3 records
taskSlot 2: 8 records
taskSlot 3: 3 records
taskSlot 4: 0 records
maxParallelim=12
parallelism=4
hot keyGroup!
A
B
C
D
1
2
0
4
5
3
6
7
8
9
10
11
15

Key Skew (Cont’d)
Solution: adjust maxParallelism
maxParallelism=128 (default),
parallelism=3:
“A” → 104 → 2
“B” → 17 → 0
“C” → 17 → 0
maxParallelism=256⇧
parallelism=3:
“A” → 232 → 2
“B” → 17 → 0
“C” → 145 → 1
hot keyGroup!
increase
maxParallelism
16

Key Skew (Cont’d)
Multiple keys are mapped to the same taskSlot
record keyGroup
key
…
taskSlot 1: 3 records
taskSlot 2: 8 records
taskSlot 3: 3 records
taskSlot 4: 0 records
maxParallelim=12
parallelism=4
hot taskSlot!
1
2
0
4
5
3
6
7
8
9
10
11
A
B
C
D
17

Key Skew (Cont’d)
Solution: adjust Parallelism or maxParallelism
maxParallelim: 12
parallelism: 4
maxParallelim: 12
parallelism: 6
maxParallelim: 12
parallelism: 3
maxParallelim: 12
parallelism: 5
1
2
0
4
5
3
6
7
8
9
10
11
1
2
0
4
5
3
6
7
8
9
10
11
1
2
0
4
5
3
6
7
8
9
10
11
1
2
0
4
5
3
6
7
8
9
10
11
18

Key Skew (Cont’d)
Solution: adjust Parallelism or maxParallelism
5
maxParallelism=128 (default),
parallelism=4:
“1” → 54 → 1
“2” → 27 → 0
“3” → 33 → 1
“4” → 4 → 0
maxParallelism=64⇩
parallelism=4:
“1” → 54 → 3
“2” → 27 → 1
“3” → 33 → 2
“4” → 4 → 0
reduce
maxParallelism
19

Key Skew (Cont’d)
Adjust ratio between maxParallelism / Parallelism
➤ Best practice: maxParallelism = ~5-10 x parallelism
● maxParallelism = 128, parallelism = 127
○ 126 taskSlots each gets 1 keyGroup,
○ 1 taskSlot gets 2 keyGroups
⇨ one taskSlot processes 100% more records than each of the others
● maxParallelism = 1280, parallelism = 127
○ 117 taskSlots each gets 10 keyGroups,
○ 10 taskSlots each gets 11 keyGroups
⇨ a fluctuation of 10% among all taskSlots
20

State Skew
● Some subtasks have much bigger state than others
● Often caused by data skew or key skew
21

Recap
Is distributing records among task slots evenly sufficient?
distribute records among task slots evenly
Does it guarantee even resource utilization ❓
Data skew
Key skew
State skew
22

Scheduling Skew
record keyGroup
key
…
taskSlot 1
3 records
taskSlot 2
4 records
taskSlot 3
3 records
taskSlot 4
4 records
maxParallelim=12
parallelism=4
No of keyGroup=12
TaskManager 1
TaskManager 2
23

Scheduling Skew (Cont’d)
TM1 has more slots than TM2, skew! Task slots are evenly distributed among TM1 and TM2
TM 1 TM 2 TM 1 TM 2
cluster.evenly-spread-out-slots: true
cluster.evenly-spread-out-slots: false
(default)
Scheduling 4 subtasks to two TMs each having 3 slots
24

Time-based Processing
● Event time vs Processing time
○ Event time is the time event has actually occurred
○ Processing time is the time when the event is processed
● Event time based computations
○ Windowing
○ Timer
● Apache Flink uses watermarks to keep track of the progress in event time
● A data operator may have multiple input channels in Flink
○ Keyed streams
○ Joining streams
Background Knowledge
26

Event Time/Watermark Skew
An Example
27
operator
5
1
2
3
4
5
6
7
watermark
Watermarks far apart
8
9
watermark
Channel 1
Channel 2

Event Time/Watermark Skew (Cont’d)
operator
state
5
1
2
3
4
5
6
7
8
9
Window: [0,5)
Channel 1
Channel 2
28

Event Time/Watermark Skew (Cont’d)
Event Time Skew:
The event distribution on time is skewed among input channels
of an operator. It could be because of:
1. the nature of the data sources
2. or some upstream tasks progress faster than others;
Impacts:
● Backpressure
● Large state and long checkpoint duration
● Job failures
Watermarks of subtasks or
input channels may deviate
from each other.
29

operator
Event Time/Watermark Skew (Cont’d)
Watermark alignment in Flink 1.15
30
state
5
1
2
3
4
5
6
7
8
9
Window: [0,5)
WatermarkStrategy watermarkStrategy =
WatermarkStrategy
.<~>forBoundedOutOfOrderness(...)
.withTimestampAssigner(...)
.withWatermarkAlignment(
watermarkGroup,
maxAllowedWatermarkDrift,
updateInterval);
sourceStream =
env.fromSource(
kafkaSource,
watermarkStrategy,
sourceName)
.map(...)
…
Consuming from this channel is on-hold
because of maxAllowedWatermarkDrift,
e.g., 1 in this case

Event Time/Watermark Skew (Cont’d)
Watermark alignment reduces checkpoint size and duration
With watermark alignment
Without watermark alignment
31

Event Time/Watermark Skew (Cont’d)
● Use JobManagerWatermarkTracker
○ Event time alignment in Amazon Kinesis Data Streams Connector in Flink documentation
○ The Flink Forward 2020 talk from Shahid from Stripe: Streaming, Fast and Slow: Mitigating
Watermark Skew in Large, Stateful Jobs
What if we have to use an earlier version of Flink?
32

✓ Skew can cause problems or failures
✓ Even the workload among not only task
slots but also taskManagers
✓ maxParalellism = ~5-10 x parallelism
✓ Pay attention to the network shuffle cost
when solving skew issues
Key Takeaways
Data Skew
Filter
Re-partition
Local aggregration
Key Skew
Hot Key
Hot keyGroup
Hot taskSlot
Use a different key
Local-Global aggregation
Stream split
Two-phase keyBy
Adjust maxParallelism
Adjust parallelism
Adjust maxParallelism
Event Time Skew
Use watermark alignment (Flink
1.15+)
JobManager Watermark
Tracker
State Skew Fix data skew and/or key skew
Scheduling Skew Set cluster.evenly-spread-out-slots:
true
34

More Related Content

Evening out the uneven: dealing with skew in Flink

1. Evening Out the Uneven: Dealing with Skews in Flink Jun Qin, Head of Solutions Architecture Karl Friedrich, Architect 1

2. Contents 01 Skew & Its Impact 02 Data Skew 03 Key Skew 04 State Skew 05 Scheduling Skew 06 Event Time/Watermark Skew 07 Key Takeaways 2

3. Skew ● Workload imbalance among subtasks/TaskManagers ○ in data to be processed ○ in state size ○ in event time ○ in resource usage ■ CPU/Memory/Disk 3

4. ● Less resource utilization ● Back pressure ● Low throughput and/or high latency ● Potential high memory usage ○ JVM Garbage Collection ○ TM heartbeats timeout ○ Task failure ○ Job restart Impact of Skew 4

5. ● File Source: some files are much larger than than the others ● Kafka Source: some Kafka partitions hold much more data than the others Examples Data Skew 10x 5

6. Data Skew(Cont’d) One task/TM is much busier that the others: Bytes/Records Received is skewed: Sympotoms 6

7. ● Option 1: (basic) do filter() in your pipeline as early as possible to reduce the data volume sourceStream .assignTimestampsAndWatermarks() .rebalance() .map() .otherTransformations() .addSink()/sinkTo() Data Skew (Cont’d) Solutions ● Option 3: implement a custom operator to do a hadoop style map side aggregation (aka. combiner) ○ See the class MapBundleOperator in flink-table-runtime ● Option 2: re-partition data by calling ○ rebalance() ○ shuffle() ○ partitionCustom() ○ keyBy() ⇨ pay attention to the shuffle cost!!! 7

8. Data Skew (Cont’d) Throughput improved by using rebalance() Rebalance then consume Consume directly ➤ If the cost of network shuffle is significant comparing the actual data processing, you may get worse results! 8

9. Key Skew Record - Key - Key Group - taskSlot record key keyGroup taskSlot KeySelector MathUtils.murmurHash(key.hashCode()) % maxParallelism keyGroup * parallelism / maxParallelism maxParallelism determines the total number of keyGroups parallelism determines how the keyGroups are grouped together and assigned to a taskSlot KeySelector determine whether there is a hot key 9

10. Key Skew (Cont’d) Hot keys record keyGroup key … taskSlot 1: 1 record taskSlot 2: 8 records taskSlot 3: 2 record taskSlot 4: 3 records hot key! 1 2 0 4 5 3 6 7 8 9 10 11 A B C D maxParallelim=12 parallelism=4 10

11. Key Skew (Cont’d) ● Option 1: (basic) use a different key that has less or no skew. Solutions of hot keys ➤ Flink SQL does this automatically if table.optimizer.agg-phase-strategy = TWO_PHASE ● Option 2: (general aggregation) local-global aggregation, similar to Hadoop combiner 11

12. Key Skew (Cont’d) ● Option 3: two-stage keyBy Solutions of hot keys stream .keyBy(key) .process(...) stream .keyBy(randomizedKey) .process(...) .keyBy(key) .process(...) ➤ randomizedKey must be deterministic: key + hashCode(anotherField) splitStream = sourceStream .process() // stream of records with hot keys splitStream .getSideOut() .keyBy(anotherField) .process(...) .keyBy(key) .process(...) // stream of records with non-hot keys splitStream .keyBy(key) .process(...) ● Option 4: If hot keys are known in advance, split the input stream in your pipeline into hot key streams and non-hot key streams. 12

13. Key Skew (Cont’d) Hot keys: experimental results Stream split One keyBy Two keyBy skew 13

14. Key Skew (Cont’d) Hot keys Stream split One keyBy Two keyBy 14

15. Key Skew (Cont’d) Multiple keys are mapped to the same keyGroup record keyGroup key … taskSlot 1: 3 records taskSlot 2: 8 records taskSlot 3: 3 records taskSlot 4: 0 records maxParallelim=12 parallelism=4 hot keyGroup! A B C D 1 2 0 4 5 3 6 7 8 9 10 11 15

16. Key Skew (Cont’d) Solution: adjust maxParallelism maxParallelism=128 (default), parallelism=3: “A” → 104 → 2 “B” → 17 → 0 “C” → 17 → 0 maxParallelism=256⇧ parallelism=3: “A” → 232 → 2 “B” → 17 → 0 “C” → 145 → 1 hot keyGroup! increase maxParallelism 16

17. Key Skew (Cont’d) Multiple keys are mapped to the same taskSlot record keyGroup key … taskSlot 1: 3 records taskSlot 2: 8 records taskSlot 3: 3 records taskSlot 4: 0 records maxParallelim=12 parallelism=4 hot taskSlot! 1 2 0 4 5 3 6 7 8 9 10 11 A B C D 17

18. Key Skew (Cont’d) Solution: adjust Parallelism or maxParallelism maxParallelim: 12 parallelism: 4 maxParallelim: 12 parallelism: 6 maxParallelim: 12 parallelism: 3 maxParallelim: 12 parallelism: 5 1 2 0 4 5 3 6 7 8 9 10 11 1 2 0 4 5 3 6 7 8 9 10 11 1 2 0 4 5 3 6 7 8 9 10 11 1 2 0 4 5 3 6 7 8 9 10 11 18

19. Key Skew (Cont’d) Solution: adjust Parallelism or maxParallelism 5 maxParallelism=128 (default), parallelism=4: “1” → 54 → 1 “2” → 27 → 0 “3” → 33 → 1 “4” → 4 → 0 maxParallelism=64⇩ parallelism=4: “1” → 54 → 3 “2” → 27 → 1 “3” → 33 → 2 “4” → 4 → 0 reduce maxParallelism 19

20. Key Skew (Cont’d) Adjust ratio between maxParallelism / Parallelism ➤ Best practice: maxParallelism = ~5-10 x parallelism ● maxParallelism = 128, parallelism = 127 ○ 126 taskSlots each gets 1 keyGroup, ○ 1 taskSlot gets 2 keyGroups ⇨ one taskSlot processes 100% more records than each of the others ● maxParallelism = 1280, parallelism = 127 ○ 117 taskSlots each gets 10 keyGroups, ○ 10 taskSlots each gets 11 keyGroups ⇨ a fluctuation of 10% among all taskSlots 20

21. State Skew ● Some subtasks have much bigger state than others ● Often caused by data skew or key skew 21

22. Recap Is distributing records among task slots evenly sufficient? distribute records among task slots evenly Does it guarantee even resource utilization ❓ Data skew Key skew State skew 22

23. Scheduling Skew record keyGroup key … taskSlot 1 3 records taskSlot 2 4 records taskSlot 3 3 records taskSlot 4 4 records maxParallelim=12 parallelism=4 No of keyGroup=12 TaskManager 1 TaskManager 2 23

24. Scheduling Skew (Cont’d) TM1 has more slots than TM2, skew! Task slots are evenly distributed among TM1 and TM2 TM 1 TM 2 TM 1 TM 2 cluster.evenly-spread-out-slots: true cluster.evenly-spread-out-slots: false (default) Scheduling 4 subtasks to two TMs each having 3 slots 24

25. That’s all for Scheduling Skew! 25

26. Time-based Processing ● Event time vs Processing time ○ Event time is the time event has actually occurred ○ Processing time is the time when the event is processed ● Event time based computations ○ Windowing ○ Timer ● Apache Flink uses watermarks to keep track of the progress in event time ● A data operator may have multiple input channels in Flink ○ Keyed streams ○ Joining streams Background Knowledge 26

27. Event Time/Watermark Skew An Example 27 operator 5 1 2 3 4 5 6 7 watermark Watermarks far apart 8 9 watermark Channel 1 Channel 2

28. Event Time/Watermark Skew (Cont’d) operator state 5 1 2 3 4 5 6 7 8 9 Window: [0,5) Channel 1 Channel 2 28

29. Event Time/Watermark Skew (Cont’d) Event Time Skew: The event distribution on time is skewed among input channels of an operator. It could be because of: 1. the nature of the data sources 2. or some upstream tasks progress faster than others; Impacts: ● Backpressure ● Large state and long checkpoint duration ● Job failures Watermarks of subtasks or input channels may deviate from each other. 29

30. operator Event Time/Watermark Skew (Cont’d) Watermark alignment in Flink 1.15 30 state 5 1 2 3 4 5 6 7 8 9 Window: [0,5) WatermarkStrategy watermarkStrategy = WatermarkStrategy .<~>forBoundedOutOfOrderness(...) .withTimestampAssigner(...) .withWatermarkAlignment( watermarkGroup, maxAllowedWatermarkDrift, updateInterval); sourceStream = env.fromSource( kafkaSource, watermarkStrategy, sourceName) .map(...) … Consuming from this channel is on-hold because of maxAllowedWatermarkDrift, e.g., 1 in this case

31. Event Time/Watermark Skew (Cont’d) Watermark alignment reduces checkpoint size and duration With watermark alignment Without watermark alignment 31

32. Event Time/Watermark Skew (Cont’d) ● Use JobManagerWatermarkTracker ○ Event time alignment in Amazon Kinesis Data Streams Connector in Flink documentation ○ The Flink Forward 2020 talk from Shahid from Stripe: Streaming, Fast and Slow: Mitigating Watermark Skew in Large, Stateful Jobs What if we have to use an earlier version of Flink? 32

33. That’s all for Event Time Skew! 33

34. ✓ Skew can cause problems or failures ✓ Even the workload among not only task slots but also taskManagers ✓ maxParalellism = ~5-10 x parallelism ✓ Pay attention to the network shuffle cost when solving skew issues Key Takeaways Data Skew Filter Re-partition Local aggregration Key Skew Hot Key Hot keyGroup Hot taskSlot Use a different key Local-Global aggregation Stream split Two-phase keyBy Adjust maxParallelism Adjust parallelism Adjust maxParallelism Event Time Skew Use watermark alignment (Flink 1.15+) JobManager Watermark Tracker State Skew Fix data skew and/or key skew Scheduling Skew Set cluster.evenly-spread-out-slots: true 34

35. Happy streaming! Q & A 35

Editor's Notes

Hello everyone, welcome back to Flink Forward. When you work in the data domain, it is likely that you face some skew issues as data are not always evenly distributed. This is the same when you do stream processing with Flink. Skews can result in wasted resources and limited scalability. So, how can we even out the uneven? My name is Jun, head of solutions architecture in Ververica. In the past years, our team have helped customers and users solve various skew-related issues in their Flink jobs or clusters. Today, together with my colleague Karl, we will present the various skew situations that users often run into and discuss the solutions for each of them. We hope this can serve as a guideline to help you reduce skew in your Flink environment.
We will start with the definition of skew and its impact to Flink jobs. Then I will present data skew & key skew. My colleague Karl will continue with state skew, scheduling skew and event time or watermark skew. We will also show experiemental results for some of the solutions. At the end, we will summarize the talk with the key takeaways.
Let us first define skew. Skew mean the workload imbalance among subtasks of a Flink job or TaskManagers of a Flink cluster. Skew can happen <CLICK> in terms of data to be processed, <CLICK> in terms of state size, <CLICK> in terms of event time <CLICK> Or in terms of CPU/Memory/Disk usage
As we saw in the previous slide, in a skew situation, some task managers may be 100% busy while the others are not even reach 50% . This leads to poor resource utilization. In a situation where majority of the data is processed by a single task slot, it will cause backpressure to upstream operators, result in low throughput and/or high latency. When you have a skew in event time, you may need to buffer lots of events in state, which can cause high memory usage, long garbage collection time. This then can lead to TM heartbeat time out, task failures and job restart.
The first type of skew is Data Skew. For example, your job consumes messages from a directory of files where some files are much larger than other files. The similar thing can happen when your job consumes from a Kafak topic where the topic has several partitions, but some partitions has much more data than other partitions.
What you will see from the Flink UI is that the records received by some subtasks is much larger than other subtasks. <CLICK> Consequently, that task is much busier than the others. And the corresponding TaskManager also has higher CPU usage than others. How can we deal with this situation in Flink?
The first basic option is to check whether some records can be filtered out at the beginning of your pipeline. If so, you can then reduce the data volume sent to downstream operators. Then skew will be less of an issue. <CLICK> The second option is to re-partition data among the subtasks of the operators following the source opeator. You can call rebalance() to distribute records is a round-robin fashion, or shuffle() to select downstream operator subtasks randomly. You can also supply your own partition scheme. or call keyBy() if you are in a keyed context. Obviously, data re-partition implies a network shuffle. So you will get performance improvement only if the network shuffle cost is insignificant comparing to the computation of the rest of pipeline. Typically, we suggest to re-partition only if some TaskManager reached 100% CPU usage because of the skew. <CLICK> The third option is to implement a custom operator to do a hadoop style map side aggregation. The purpose here is to reduce the overall workload of downstream operators. For example instead of sending 10 raw records downstream, you can send 1 record with the aggregated value. You can see such an example in MapBundleOperator class in the flink-table-runtime package.
Here is an experiment we conducted. We have 4 subtasks consuming directly from a Kafka topic with 4 partitions where one of the partition contains 80% of the total data volume. The throughput of our job was 30K per seconds. After we re-partition the data with rebalance(), we increase the job throughput to 40K. <CLICK> But as mentioned before, if the network shuffle cost is significant comparing to the actual data processing, you may get worse performance with data re-partition.
As mentioned in the data skew section, we can use keyBy to re-partition data to solve data skew issue. But if the data are not evenly distributed among keys, it then become a key skew issue. Before we deep dive into key skew issues, let us have a look at first how Flink maps records to keys, and then to keyGroups and taskSlots. Flink maps records to keys by KeySelector. So KeySelector determines the number of records that are mapped to a particularly key. If the number is large, the key becomes a hot key. To compute keyGroups, Flink applies murmurHash to keys’ hashCode, modulo maxParallelism. So, maxParallelism determines the total number of keyGroups. For a given parallelism, all keyGroups are split into several ranges, each range is assigned to a taskSlot. So, parallelism here determines how the keyGroups are grouped together and mapped to taskSlots.
Let us look at a concrete example. Here, every record is mapped to a key. 8 of them are mapped to key A. Key A is mapped to keyGroup 3. Given a maxParalellism of 12 and parallelism of 4, all 12 keyGroups are split into 4 ranges, represented by orange/yellow/green/red. keyGroup 3 is mapped to taskSlot 2. This means, taskSlot 2 will process 8 records, while other taskSlots only get 1/2/3 records. <CLICK> This is the Hot Key issue.
The first basic solution to the hot key issue is to use a different key that has less or no skew. For example, instead of keyBy currency, you can keyBy accountId. If you are doing general aggregation, you can try the local-global aggregation approach, aks. two-phase aggregation. This is similar to Hadoop combiner. As shown in the picture here, instead of keyBy color directly, you first aggregate locally in each subtask. The local aggregation can help to accumulate a certain amount of input records which have the same key into a single accumulator. The global aggregation will then receive the reduced accumulators instead of large number of raw input records. This can significantly reduce the network shuffle and make the key kew less of an issue. Flink SQL does this automatically if you enable the TWO-PHASE aggregation strategy.
The third approach to the hot key issue is two-stage keyBy. Because the key is skewed, we first keyBy a randomized key that consists of the original key and a random part. The assumption here is that the amount of the output data from the first keyBy(randomizedKey) and its process function is significantly reduced in comparison to the amount of original input data. Then in the second step, we can keyBy the original key. Because of the reduced data volume, the hot skew is not an issue any more. One thing to note here is that, given an input record, the randomzied key must be deterministic. For example, you can use the original key, plus the hashCode of another field of the input records. <CLICK> If you know the hot key in advance, another approach to solve the hot key issue is to split the input stream in your pipeline into streams of hot keys and streams of other keys, by using Flink’s side output. For the streams of other keys, you do as usual with your keyBy. For the streams of hot keys, you can keyBy another field to parallelize the data processing and then keyBy the original key to gather the aggregated results.
Here, we simulated a hot key in our Kafka topic, tested the two-keyBy solution and the stream split solution. As we can see here at the left-hand side, due to the existence hot key, one of the TMs is bottlenecked on CPU if we just keyBy with the original key. The CPU usage is well balanced in the two-keyBy solution and the stream split solution, as seen at the right-hand side.
As a result, the job throughput is increased from 38k to 42K in the two-keyBy approach and to 50K in the stream split approach. <PAUSE> This is all about the hot key issue.
Let us go back the original picture. Here, we do not have a hot key, because each key is associated with 3-4 records. But because both key A and key B are mapped keyGroup 3. For the similar reason as mentioned before, taskSlot 2 gets the majority of the data to be processed. This is a hot keyGroup issue because multiple keys are mapped to the same keyGroup. The solution here is to adjust maxParallelism.
Let us look at a concrete example. With the default maxParallelism of 128, if you want to process records with keys of string A, B, C in your pipeline with a parallelism of 3, key B and C will be mapped to keyGroup 17 and processed by taskSlot 0, key A will be processed by taskSlot 2, taskSlot 1 is idle. If you change the maxParallelism to 256, the three keys are evenly distributed to three taskSlots
There is another scenario. Here we do not have hot keys, also no hot keyGroups. But because both key A and key B are mapped to taskSlot 2. So, the taskSlot 2 is hot again. The solution for this particular example is to adjust the parallelism.
For example, given a maxParallelism of 12, keyGroups are represented by the numbers, are grouped and mapped to taskSlots based on colors. this slide shows how the keyGroups are distributed among taskSlots by adjusting the parallelism. <MOUSE> For example, when parallelism=3, keyGroup 0-3 are mapped to taskSlot0, keyGroup4-7 are mapped to taskSlot1, keyGroup8-11 are mapped to taskSlot2.
You can also adjust maxParallelism to achieve an even distribution. With the default maxParallelism of 128, if you want to process records with keys of string 1,2,3,4 in your pipeline with a parallelism of 4, key 1 and 3 are processed by taskSlot 1, key 2 and 4 are processed by taskSlot 0, taskSlot 2 & 3 are idle. If you change the maxParallelism to 64, the four keys are evenly distributed to 4 taskSlots
When changing maxParallelism and parallelism, you should also pay attention to the ratio between maxParallelim and parallelism as it can also impact the data distribution among taskSlots. With the default maxParallelism of 128, and a parallelism of 127, most of the taskSlots will get one keyGroup each, but there is one taskSlot get two keyGroups. This taskSlot will get 100% more work load comparing with other taskSlots. If we now change the maxParallelism to 1280, then all of the taskSlots will get either 10 or 11 keyGroups, meaning there is a fluctuation of 10% workload among all taskSlots. The workload is more evenly distributed in this example than the previous example. So the best practice is to have 5-10 times of parallelism as the maxParallelism. Keep in mind that setting the maximum parallelism to a very large value can negative impact to the performance because state backends have to keep internal data structures that scale with the number of key-groups. Also because of this, if you want to change the maxParallelism of an existing job without discarding the state, you need to convert your state via State Processor API. With this, I am now handing over to my colleague Karl for the other type of skews. <FINISH>
Thanks! The 3rd kind of skews is “State Skew”. It refers to the case when Some subtasks have much bigger state than others. In this example, subtask 2’s state is much bigger than others, and takes much longer to checkpoint. State Skew is Often caused by data skew or key skew, which we discussed earlier. And typically it can be solved by a combination of solutions to data skew or key skew, depending on the situation. [Skip, Audience Qs] KQ: ID is TM ID or slot? Flink UI, click an operator: Subtask ID, which is mapped to TaskSlot. An operator can have many instances (tasks/subtasks). Each subtask is scheduled to a TaskSlot. But it’s not always 1-to-1 mapping. E.g. a TaskSlot can run an Operator chain of 4 Ops, i.e. 4 subtasks are scheduled to 1 TaskSlot.
So far we’ve discussed data skews, key skews and state skews. They are very important. And Avoiding them helps us distribute records among task slots evenly. So, Does it guarantee even resource utilization? Unfortunately, no.
Suppose we have our data evenly distributed in key groups and then into 4 subtasks. (Everything looks nice.) <CLICK> Next, the subtasks will be scheduled in TaskManagers. <NEXT> [Skip] For example, with maxParallelism=128 (default), parallelism=4: 1 → 86 → 2 2 → 127 → 3 3 → 113 → 3 4 → 7 → 0 5 → 126 → 3
Let’s assume that we have 2 Task Managers, who have 3 Task Slots each. And we have 4 subtasks to do. By default, Flink would schedule 3 subtasks to TaskManager 1, and the left 1 subtask to TaskManager 2. Now we have TaskManager 1 doing 75% of work, while TaskManager 2 doing 25%. If this continues, we are not using our resources evenly. We call this Scheduling Skew, the 4th kind of skews. What can we do? <CLICK> We can turn on the Flink option cluster.evenly-spread-out-slots. <CLICK> Now the scheduler will take slots from the least used TM (when there aren’t any other preferences). This way the subtasks will be scheduled evenly among TaskManagers. <Note> Please note: This example assumes that both TM1 and TM2 are registered with the cluster before the jobs are submitted. And in clusters that dynamically add TMs as needed, the cluster.evenly-spread-out-slots option doesn't make sense. In addition, if each TM has only one slot (common in K8s env), then this configuration has no effect.
That’s all for Scheduling Skew! <Pause> Next, we’re gonna look at a very different kind of skews.
But before that, please allow me to refresh on some background knowledge first. In the domain of stream processing, we have the notions of Event time and Processing time. In general, Event time is the time when the event actually occurred, determined by the timestamp on the data record; While Processing time is the time when the event is processed, determined by the clock on the system processing the record. Many use cases apply window algorithms and timers based on Event time. And Apache Flink uses watermarks to keep track of the progress in event time. In addition, A data operator may have multiple input channels in Flink.
Let’s look at this scenario. We have an Op processing 2 channels. And we indicate the watermarks from the channels by numbers. On the right side, When the Op receives Watermark 5 from channel 2, the greatest watermark it has received from channel 1 is Watermark 1. In other words, channel 2 progresses faster than channel 1 at the moment. (The greatest available watermark in channel 2 is 9, while the greatest in channel 1 is 5.) [Ref] Content of channels may be from: different streams, network shuffle, or keyBy(). keyBy is a type of network shuffle
If the Op does an event time based computation, like a window operation, data from channel 2 will pile up (in the Op) when the Op waits for data from channel 1. This is called Event Time Skew.
Event Time Skew happens when The event distribution on time is skewed among input channels of an operator. It could be because: Of the nature of the data sources (e.g. we dedup or join files of various sizes, each source takes 1 file at a time as input) Or because some upstream tasks progress faster than others; And resulting in varied event time and watermarks from different input channels of the operator in concern. <Pause> Event Time Skew may result in: Backpressure Large state and long checkpoint duration And Job failures <Pause> What to do? <NEXT> [Ref] How to detect Watermark Skew? How mitigating event-time skewness can reduce checkpoint failures and task manager crashes Watch the lag of the assigned Kafka partitions per Flink Kafka consumer. Unfortunately, these metrics are not available. Need to draw your own conclusions based on some indirect indicators, such as looking into whether the total checkpointing times increase faster than the state size, or whether there are differences in the checkpointing acknowledgement times between the various instances of the stateful operators. The latter may be partly because your data is not evenly distributed. Maybe the most reliable indicator is an irregular watermark progression of the Flink Kafka Consumer instances. Flink UI, click an operator: Subtask ID, which is mapped to TaskSlot. An operator can have many instances (tasks/subtasks). Each subtask is scheduled to a TaskSlot. But it’s not always 1-to-1 mapping. E.g. a TaskSlot can run an Operator chain of 4 Ops, i.e. 4 subtasks are scheduled to 1 TaskSlot. For the relationship of records, KeyGroups, and TaskSlots, see slide 23 <Scheduling Skew>. { After using keyBy(), we cannot see the watermark skews in Flink UI, as it shows the smallest watermark of all input channels of each subtask. In this demo, each subtask processes its own Kafka partition. They don’t interact with each other (no network shuffle involved). } Backpressure: At a high level, backpressure happens if some operator(s) in the Job Graph cannot process records at the same rate as they are received. This fills up the input buffers of the subtask that is running this slow operator. Once the input buffers are full, backpressure propagates to the output buffers of the upstream subtasks. Once those are filled up, the upstream subtasks are also forced to slow down their records’ processing rate to match the processing rate of the operator causing this bottleneck down the stream. Backpressure further propagates up the stream until it reaches the source operators.
We can use Watermark alignment in Flink 1.15. With Watermark alignment, Flink pauses consuming from sources/tasks which generated watermarks that are too far into the future. Meanwhile it continues reading records from other sources/tasks which can move the combined watermark forward and thus unblock the faster ones. Look at the diagram on the right. We can use The parameter maxAllowedWatermarkDrift to define the maximal watermark difference between the 2 channels for the operator. In this example, if we set maxAllowedWatermarkDrift to 1, then when the watermark from Channel 1 reaches 5, the op will consume no more events from Channel 2 after Channel 2’s watermark reaches 6. It will continue consuming from Channel 2 after Channel 1’s watermark increases. This way the op doesn’t need to buffer excessive data, and the Event Time Skew is mitigated. The cost is more RPC messages between TMs and the JM. <NOTE> Please note that It Only works with sources which are implemented with the new source interface (FLIP-27) It does not work if timestamps and watermarks have been assigned to source before applying watermark alignment. Let’s look at some results. <NEXT> [Ref] https://nightlies.apache.org/flink/flink-docs-release-1.15/api/java/org/apache/flink/api/common/eventtime/WatermarkStrategy.html#withWatermarkAlignment-java.lang.String-java.time.Duration-java.time.Duration- @Experimental default WatermarkStrategy<T> withWatermarkAlignment(String watermarkGroup, java.time.Duration maxAllowedWatermarkDrift, java.time.Duration updateInterval) Creates a new WatermarkStrategy that configures the maximum watermark drift from other sources/tasks/partitions in the same watermark group. The group may contain completely independent sources (e.g. File and Kafka). Once configured Flink will "pause" consuming from a source/task/partition that is ahead of the emitted watermark in the group by more than the maxAllowedWatermarkDrift. Parameters: watermarkGroup - A group of sources to align watermarks maxAllowedWatermarkDrift - Maximal drift, before we pause consuming from the source/task/partition updateInterval - How often tasks should notify coordinator about the current watermark and how often the coordinator should announce the maximal aligned watermark. https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment-_beta_ Note: You can enable watermark alignment only for FLIP-27 sources. It does not work for legacy or if applied after the source via DataStream#assignTimestampsAndWatermarks. When enabling the alignment, you need to tell Flink, which group should the source belong. You do that by providing a label (e.g. alignment-group-1) which bind together all sources that share it. Moreover, you have to tell the maximal drift from the current minimal watermarks across all sources belonging to that group. The third parameter describes how often the current maximal watermark should be updated. The downside of frequent updates is that there will be more RPC messages travelling between TMs and the JM. In order to achieve the alignment Flink will pause consuming from the source/task, which generated watermark that is too far into the future. In the meantime it will continue reading records from other sources/tasks which can move the combined watermark forward and that way unblock the faster one. Note: As of 1.15, Flink supports aligning across tasks of the same source and/or different sources. It does not support aligning splits/partitions/shards in the same task. In a case where there are e.g. two Kafka partitions that produce watermarks at different pace, that get assigned to the same task watermark might not behave as expected. Fortunately, worst case it should not perform worse than without alignment. Given the limitation above, we suggest applying watermark alignment in two situations: You have two different sources (e.g. Kafka and File) that produce watermarks at different speeds You run your source with parallelism equal to the number of splits/shards/partitions, which results in every subtask being assigned a single unit of work. “if timestamps and watermarks have been assigned to source”, what does it mean? Those are defined by the watermarkStrategy parameter of StreamExecutionEnvironment.fromSource(). [Skip] Does not work with: watermarkStrategy.withIdleness()
This is the data collected for a Flink job having Event Time skew. The job consumes a Kafka topic of 4 partitions, and the watermarks (event times) of 1 partition is much smaller than the other partitions. And it processes data in a window operator with Event Time. The Left graph shows the checkpoint size with and without using watermark alignment. And The Right graph shows the checkpoint duration with and without using watermark alignment. After using watermark alignment, both checkpoint size and duration reduced significantly, as the Event Time skew is mitigated. [Skip] KQ5: how did you draw the data of 2 jobs in 1 diagram? Yes draw the data of 2 jobs, in Grafana.
What if we have to use an earlier version of Flink? <Pause> We can use JobManagerWatermarkTracker. <Pause> For example, If you use Amazon Kinesis Data Streams, check out Event time alignment in Amazon Kinesis Data Streams Connector. Otherwise, the talk <Streaming, Fast and Slow>, from Flink Forward 2020 might help you. Both solutions use JobManagerWatermarkTracker. Essentially we use a global aggregate to synchronize per subtask watermarks. Each subtask uses a per shard queue to control the rate at which records are emitted downstreams (based on how far ahead of the global watermark the next record in the queue is. ) [Ref] Event time alignment The Flink Kinesis Consumer optionally supports synchronization between parallel consumer subtasks (and their threads) to avoid the event time skew related problems described in Event time synchronization across sources. To enable synchronization, set the watermark tracker on the consumer: JobManagerWatermarkTracker watermarkTracker = new JobManagerWatermarkTracker("myKinesisSource"); consumer.setWatermarkTracker(watermarkTracker); The JobManagerWatermarkTracker uses a global aggregate to synchronize per subtask watermarks. Each subtask uses a per shard queue to control the rate at which records are emitted downstream based on how far ahead of the global watermark the next record in the queue is. The “emit ahead” limit is configured via ConsumerConfigConstants.WATERMARK_LOOKAHEAD_MILLIS. Smaller values reduce the skew but also the throughput. Larger values will allow the subtask to proceed further before waiting for the global watermark to advance. Another variable in the throughput equation is how frequently the watermark is propagated by the tracker. The interval can be configured via ConsumerConfigConstants.WATERMARK_SYNC_MILLIS. Smaller values reduce emitter waits and come at the cost of increased communication with the job manager. Since records accumulate in the queues when skew occurs, increased memory consumption needs to be expected. How much depends on the average record size. With larger sizes, it may be necessary to adjust the emitter queue capacity via ConsumerConfigConstants.WATERMARK_SYNC_QUEUE_CAPACITY. [Skip] Or you can implement a rate limiter. How do you rate-limit? Per volume? How to rate-limit per event time?
And that concludes our discussion of Event Time Skew.
In this talk, we covered 5 kinds of skews, all of which may impact the performance and scalability of your systems significantly. For Data Skew, we may Filter unnecessary data, Re-partition data, Or do Local aggregration. For Hot Key, we may Use a different key Use Local-Global aggregation Use Two-phase keyBy Or Use Stream split For Hot keyGroup, we may Adjust maxParallelism. For Hot taskSlot, we may Adjust parallelism or Adjust maxParallelism. For State Skew, we may find the root cause and fix the underlying data skew and/or key skew. For Scheduling Skew, we may Set cluster.evenly-spread-out-slots to true For Event Time Skew, we may leverage watermark alignment or JobManager Watermark Tracker. <Pause> In principle, we wanna Even the workload among not only Task slots but also Task Managers; And Look out for Event Time Skews; But pay attention to the cost of your solutions, <Pause> like network shuffle; <Pause> All of Those were learned from years of experience with large scale distributed systems. And we humbly hope that they can help you build performant and scalable systems! And happy streaming! [Skip] Workitems TODOs! FF week support?
And happy streaming!