Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2rtxaMm.
Tyler Akidau explores the relationship between the Beam Model and stream & table theory. He explains what is required to provide robust stream processing support in SQL and discusses concrete efforts that have been made in this area by the Apache Beam, Calcite, and Flink communities, compare to other offerings such as Apache Kafka’s KSQL and Apache Spark’s Structured streaming. Filmed at qconlondon.com.
Tyler Akidau is a senior staff software engineer at Google, where he is the technical lead for the Data Processing Languages & Systems group, responsible for Google's Apache Beam efforts, Google Cloud Dataflow, and internal data processing tools like Google Flume, MapReduce, and MillWheel. His also a founding member of the Apache Beam PMC.
1 of 76
More Related Content
Streaming SQL Foundations: Why I ❤ Streams+Tables
1. 1
Foundations of streaming SQL
or: how I learned to love stream & table theory
Slides: https://s.apache.org/streaming-sql-qcon-london
Tyler Akidau
Apache Beam PMC
Software Engineer at Google
@takidau
Covering ideas from across the Apache Beam, Apache Calcite, Apache Kafka, and Apache Flink communities, with
thoughts and contributions from Julian Hyde, Fabian Hueske, Shaoxuan Wang, Kenn Knowles, Ben Chambers, Reuven
Lax, Mingmin Xu, James Xu, Martin Kleppmann, Jay Kreps and many more, not to mention that whole database
community thing...
QCon London 2018
2. InfoQ.com: News & Community Site
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
sql-streaming
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
4. 2
Table of Contents
01
02
Stream & Table Theory
A Basics
B The Beam Model
Streaming SQL
A Time-varying relations
B SQL language extensions
Chapter 7
Chapter 9
5. 3
01 Stream & Table Theory
TFW you realize everything you do was invented by the database community decades ago...
A Basics
B The Beam Model
7. 5
Special theory of stream & table relativity
streams → tables:
tables → streams:
The aggregation of a stream of
updates over time yields a table.
The observation of changes to a
table over time yields a stream.
9. 7
01 Stream & Table Theory
TFW you realize everything you do was invented by the database community decades ago...
A Basics
B The Beam Model
10. 8
The Beam Model
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
11. 9
Reconciling streams & tables w/ the Beam Model
● How does batch processing fit into all of this?
● What is the relationship of streams to bounded and
unbounded datasets?
● How do the four what, where, when, how questions map
onto a streams/tables world?
28. 26
Reconciling streams & tables w/ the Beam Model
● How does batch processing fit into all of this?
● What is the relationship of streams to bounded and
unbounded datasets?
● How do the four what, where, when, how questions map
onto a streams/tables world?
1. Tables are read into streams.
2. Streams are processed into new streams until a
grouping operation is hit.
3. Grouping turns the stream into a table.
4. Repeat steps 1-3 until you run out of operations.
29. 27
Reconciling streams & tables w/ the Beam Model
● How does batch processing fit into all of this?
● What is the relationship of streams to bounded and
unbounded datasets?
● How do the four what, where, when, how questions map
onto a streams/tables world?
Streams are the in-motion form of data
both bounded and unbounded.
30. 28
Reconciling streams & tables w/ the Beam Model
● How does batch processing fit into all of this?
● What is the relationship of streams to bounded and
unbounded datasets?
● How do the four what, where, when, how questions map
onto a streams/tables world?
31. 29
The Beam Model
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
32. 30
The Beam Model
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
36. 34
The Beam Model
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
37. 35
Windowing divides data into event-time-based finite chunks.
Often required when doing aggregations over unbounded data.
Fixed Sliding
1 2 3
54
Sessions
2
431
Key
2
Key
1
Key
3
Time
1 2 3 4
Where in event time?
38. 36
Where in event time?
PCollection<KV<User, Score>> input = IO.read(...)
.apply(ParDo.of(new ParseFn());
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
40. 38
The Beam Model
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
41. 39
• Triggers control
when results are
emitted.
• Triggers are often
relative to the
watermark.
ProcessingTime
Event Time
~Watermark
Ideal
Skew
When in processing time?
42. 40
When in processing time?
PCollection<KV<User, Score>> input = IO.read(...)
.apply(ParDo.of(new ParseFn());
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark())
.apply(Sum.integersPerKey());
44. 42
The Beam Model: asking the right questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
45. 43
How do refinements relate?
PCollection<KV<User, Score>> input = IO.read(...)
.apply(ParDo.of(new ParseFn());
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark().withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
48. 46
Reconciling streams & tables w/ the Beam Model
● How does batch processing fit into all of this?
● What is the relationship of streams to bounded and
unbounded datasets?
● How do the four what, where, when, how questions map
onto a streams/tables world?
49. 47
General theory of stream & table relativity
Pipelines : tables + streams + operations
Tables : data at rest
Streams : data in motion
Operations : (stream | table) → (stream | table) transformations
● stream → stream: Non-grouping (element-wise) operations
Leaves stream data in motion, yielding another stream.
● stream → table: Grouping operations
Brings stream data to rest, yielding a table.
Windowing adds the dimension of time to grouping.
● table → stream: Ungrouping (triggering) operations
Puts table data into motion, yielding a stream.
Accumulation dictates the nature of the stream (deltas, values, retractions).
● table → table: (none)
Impossible to go from rest and back to rest without being put into motion.
51. 49
Relational algebra
User Score Time
Julie 7 12:01
Frank 3 12:03
Julie 1 12:03
Julie 4 12:07
Score Time
7 12:01
3 12:03
1 12:03
4 12:07
πScore,Time
(UserScores)πUserScoresπ SELECT Score, Time
FROM UserScores;
-----------------
| Score | Time |
-----------------
| 7 | 12:01 |
| 3 | 12:03 |
| 1 | 12:03 |
| 4 | 12:07 |
-----------------
Relational algebra SQLRelation
52. 50
Relations evolve over time
12:07> SELECT * FROM
UserScores;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 7 | 12:01 |
| Frank | 3 | 12:03 |
| Julie | 1 | 12:03 |
| Julie | 4 | 12:07 |
-------------------------
12:03> SELECT * FROM
UserScores;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 7 | 12:01 |
| Frank | 3 | 12:03 |
| Julie | 1 | 12:03 |
-------------------------
12:00> SELECT * FROM
UserScores;
-------------------------
| Name | Score | Time |
-------------------------
-------------------------
12:01> SELECT * FROM
UserScores;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 7 | 12:01 |
-------------------------
53. 51
Classic SQL vs Streaming SQL
Classic SQL classic relations single point in time:: ::
Streaming SQL time-varying relations every point in time:: ::
54. 52
Classic SQL vs Streaming SQL
Classic SQL classic relations single point in time:: ::
Streaming SQL time-varying relations every point in time:: ::
61. 59
Time-varying relations: tables
12:07> SELECT TABLE Name,
SUM(Score), MAX(Time) FROM
UserScores GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 12 | 12:07 |
| Frank | 3 | 12:03 |
-------------------------
12:03> SELECT TABLE Name,
SUM(Score), MAX(Time) FROM
UserScores GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 8 | 12:03 |
| Frank | 3 | 12:03 |
-------------------------
12:01> SELECT TABLE Name,
SUM(Score), MAX(Time) FROM
UserScores GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 7 | 12:01 |
-------------------------
12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
-----------------------------------------------------------------------------------------------------------------
| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | |
| | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
-----------------------------------------------------------------------------------------------------------------
12:00> SELECT TABLE Name,
SUM(Score), MAX(Time) FROM
UserScores GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
-------------------------
62. 60
Time-varying relations: tables
12:01> SELECT TABLE Name,
SUM(Score), MAX(Time) FROM
UserScores GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 7 | 12:01 |
-------------------------
12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
-----------------------------------------------------------------------------------------------------------------
| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | |
| | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
-----------------------------------------------------------------------------------------------------------------
12:07> SELECT TABLE Name, SUM(Score), MAX(Time) AS OF
SYSTEM TIME ‘12:01’ FROM UserScores GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 7 | 12:01 |
-------------------------
63. 61
Time-varying relations: tables
12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
-----------------------------------------------------------------------------------------------------------------
| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | |
| | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
-----------------------------------------------------------------------------------------------------------------
12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
...
12:00
12:00> SELECT TABLE Name,
SUM(Score), MAX(Time) FROM
UserScores GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
-------------------------
64. 62
Time-varying relations: streams
12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
-----------------------------------------------------------------------------------------------------------------
| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | |
| | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
-----------------------------------------------------------------------------------------------------------------
12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 7 | 12:01 |
...
12:01
12:00> SELECT TABLE Name,
SUM(Score), MAX(Time) FROM
UserScores GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
-------------------------
65. 63
Time-varying relations: streams
12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
-----------------------------------------------------------------------------------------------------------------
| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | |
| | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
-----------------------------------------------------------------------------------------------------------------
12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 7 | 12:01 |
...
12:01
12:01> SELECT TABLE Name,
SUM(Score), MAX(Time) FROM
UserScores GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 7 | 12:01 |
-------------------------
66. 64
Time-varying relations: streams
12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
-----------------------------------------------------------------------------------------------------------------
| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | |
| | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
-----------------------------------------------------------------------------------------------------------------
12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 7 | 12:01 |
| Frank | 3 | 12:03 |
| Julie | 8 | 12:03 |
...
12:03
12:01> SELECT TABLE Name,
SUM(Score), MAX(Time) FROM
UserScores GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 7 | 12:01 |
-------------------------
67. 65
Time-varying relations: streams
12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
-----------------------------------------------------------------------------------------------------------------
| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | |
| | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
-----------------------------------------------------------------------------------------------------------------
12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 7 | 12:01 |
| Frank | 3 | 12:03 |
| Julie | 8 | 12:03 |
...
12:03
12:03> SELECT TABLE Name,
SUM(Score), MAX(Time) FROM
UserScores GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 8 | 12:03 |
| Frank | 3 | 12:03 |
-------------------------
68. 66
Time-varying relations: streams
12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
-----------------------------------------------------------------------------------------------------------------
| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | |
| | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
-----------------------------------------------------------------------------------------------------------------
12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 7 | 12:01 |
| Frank | 3 | 12:03 |
| Julie | 8 | 12:03 |
| Julie | 12 | 12:07 |
...
12:07
12:03> SELECT TABLE Name,
SUM(Score), MAX(Time) FROM
UserScores GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 8 | 12:03 |
| Frank | 3 | 12:03 |
-------------------------
69. 67
Time-varying relations: streams
12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
-----------------------------------------------------------------------------------------------------------------
| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
| | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | |
| | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | |
| ------------------------- | ------------------------- | ------------------------- | ------------------------- |
-----------------------------------------------------------------------------------------------------------------
12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 7 | 12:01 |
| Frank | 3 | 12:03 |
| Julie | 8 | 12:03 |
| Julie | 12 | 12:07 |
...
12:07
12:07> SELECT TABLE Name,
SUM(Score), MAX(Time) FROM
UserScores GROUP BY Name;
-------------------------
| Name | Score | Time |
-------------------------
| Julie | 12 | 12:07 |
| Frank | 3 | 12:03 |
-------------------------
70. 68
How does this relate to streams & tables?
capture a point-in-time snapshot
of a time-varying relation.
capture the evolution of a
time-varying relation over time.
Tables
Streams
72. 70
When do you need SQL extensions for streaming?
As a table:
As a stream:
SQL extensions rarely needed.
SQL extensions sometimes needed.
How is output consumed?
good defaults = often not needed
73. 71
When do you need SQL extensions for streaming?*
Explicit table / stream selection
● SELECT TABLE * from X;
● SELECT STREAM * from X;
Timestamps and windowing
● Event-time columns
● Windowing. E.g.,
SELECT * FROM X GROUP BY
SESSION(<COLUMN> INTERVAL '5'
MINUTE);
○ Grouping by timestamp
○ Complex multi-row transactions
inexpressible in declarative SQL
(e.g., session windows)
Sane default table / stream selection
● If all inputs are tables, output is a table
● If any inputs are streams, output is a stream
Simple triggers
● Implicitly defined by characteristics of the sink
● Optionally be configured outside of query.
● Per-query, e.g.: SELECT * from X EMIT <WHEN>;
● Focused set of use cases:
○ Repeated updates
... EMIT AFTER <TIMEDELTA>
○ Completeness
... EMIT WHEN WATERMARK PAST <COLUMN>
○ Repeated updates + completeness
(e.g., early/on-time/late pattern)
... EMIT AFTER <TIMEDELTA> AND WHEN
WATERMARK PAST <COLUMN>
* Most of these extensions are theoretical at this
point; very few have concrete implementations.