Productizing Structured Streaming Jobs

Productizing Structured
Streaming Jobs
Burak Yavuz
April 24, 2019 – SAIS 2019 San Francisco

Who am I
● Software Engineer – Databricks
- “We make your streams come true”
● Apache Spark Committer
● MS in Management Science & Engineering -
Stanford University
● BS in Mechanical Engineering - Bogazici
University, Istanbul

Writing code is fun…
… is that all we do?

Image from: https://www.smartsheet.com/sites/default/files/IC-Software-Development-Life-Cycle.jpg

Let’s look at the operational
aspects of data pipelines

Agenda
How to
• Test
• Monitor
• Deploy
• Update
Structured Streaming Jobs

Structured Streaming
stream processing on Spark SQL engine
fast, scalable, fault-tolerant
rich, unified, high level APIs
deal with complex data and complex workloads
rich ecosystem of data sources
integrate with many storage systems

Structured Streaming @
1000s of customer streaming apps
in production on Databricks
1000+ trillions of rows processed
in production

Anatomy of a Streaming Query
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy($"value".cast("string"))
.count()
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode(OutputMode.Complete())
.option("checkpointLocation", "…")
.start()
Source
• Specify one or more locations
to read data from
• Built in support for
Files/Kafka/Socket,
pluggable.
• Can include multiple sources
of different types using
union()

spark.readStream
.format("kafka")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.start()
Transformation
• Using DataFrames,
Datasets and/or SQL.
• Catalyst figures out how to
execute the transformation
incrementally.
• Internal processing always
exactly-once.

spark.readStream
.format("kafka")
.load()
.writeStream
.format("kafka")
.start()
Sink
• Accepts the output of each
batch.
• When supported sinks are
transactional and exactly
once (Files).
• Use foreach to execute
arbitrary code.

spark.readStream
.format("kafka")
.load()
.writeStream
.format("kafka")
.outputMode("update")
.start()
Output mode – What's output
• Complete – Output the whole answer
every time
• Update – Output changed rows
• Append – Output new rows only
Trigger – When to output
• Specified as a time, eventually
supports data size
• No trigger means as fast as possible

spark.readStream
.format("kafka")
.load()
.writeStream
.format("kafka")
.start()
Checkpoint
• Tracks the progress of a
query in persistent storage
• Can be used to restart the
query if there is a failure.

Data Pipelines @ Databricks
Event Based
Reporting
Streaming
Analytics
Bronze Tables Silver Tables Gold Tables

Event Based File Sources
• Launched Structured Streaming connectors:
• s3-sqs on AWS (DBR 3.5)
• abs-aqs on Azure (DBR 5.0)
• As blobs are generated:
• Events are published to SQS/AQS
• Spark reads these events
• Then reads original files from
blob storage system Azure
Blob
Storage Event Grid
Queue Storage
AWS SQS
AWS S3

Properties of Bronze/Silver/Gold
• Bronze tables
• No data processing
• Deduplication + JSON => Parquet conversion
• Data kept around for a couple weeks in order to fix mistakes just in case
• Silver tables
• Tens/Hundreds of tables
• Directly queryable tables
• PII masking/redaction
• Gold tables
• Materialized views of silver tables
• Curated tables by the Data Science team

Why this Architecture?
• Maximize Flexibility
• Maximize Scalability
• Lower Costs

See TD’s talk:
“Designing Structured Streaming
Pipelines—How to Architect Things Right”
April 25 2:40pm – Streaming Track

Testing
spark.readStream
.format("kafka")
.load()
.writeStream
.format("kafka")
.start()
- How do we test this
code?
- Do we need to set up
Kafka?
- How do we verify
result correctness?

Testing
Strategy 1: Don’t care about sources and sinks. Just test your
business logic, using batch DataFrames
Pros:
- Easy to do in
Scala/Python
Cons:
- Not all batch operations
are supported in Streaming

Testing
Strategy 2: Leverage the StreamTest test harness available in Apache
Spark
val inputData = MemoryStream[Array[Byte]]
val stream = inputData.toDS().toDF("value")
testStream(stream, OutputMode.Update)(
AddData(inputData, "a".getBytes(), "b".getBytes()),
CheckAnswer(("a" -> 1), ("b" -> 1))
)

Testing
Strategy 2: Leverage the StreamTest test harness
available in Apache Spark
CheckAnswer(("a" -> 1), ("b" -> 1))
)
Source is in
memory
Schema can be set
arbitrarily to mimic
real source

Testing
CheckAnswer(("a" -> 1), ("b" -> 1))
)
Transformation
unchanged.

Testing
AddData(inputData, ...),
CheckAnswer(("a" -> 1), ("b" -> 1))
)
Starts a stream outputting
data to a memory sink

Testing
CheckAnswer(("a" -> 1), ("b" -> 1))
)
Add
data to
the
source

Testing
CheckAnswer(("a" -> 1), ("b" -> 1))
)
Process all data and
check result

Testing
Available actions in StreamTest:
- StartStream: Allows you to provide a trigger, checkpoint location, or SQL
configurations for your stream
- AddData: Adds data to your source
- CheckAnswer: Check the current data available in your sink
- CheckLastBatch: Check data that was written to your sink in the last
epoch/micro-batch
- StopStream: Stop your stream to mimic failures/upgrades
- ExpectFailure: Allows you to test failure scenarios on the last batch based
on input data

Testing
When things go wrong:
[info] - map with recovery *** FAILED *** (8 seconds, 79 milliseconds)
[info] == Results ==
[info] !== Correct Answer - 6 == == Spark Answer - 3 ==
[info] struct<value:int> struct<value:int>
[info] [2] [2]
[info] [3] [3]
[info] [4] [4]
[info] ![5]
[info] ![6]
[info] ![7]

Testing
When things go wrong (cont’d):
[info] == Progress ==
[info] AddData to MemoryStream[value#1]: 1,2,3
[info] StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@4
0afc81b,Map(),null)
[info] CheckAnswer: [2],[3],[4]
[info] StopStream
[info] StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@1
074e4ef,Map(),null)
[info] => CheckAnswer: [2],[3],[4],[5],[6],[7]
[info]

Testing
When things go wrong (cont’d):
[info] == Stream ==
[info] Output Mode: Append
[info] Stream state: {MemoryStream[value#1]: 0}
[info] Thread state: alive
[info] Thread stack trace: java.lang.Thread.sleep(Native Method)
[info] org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBa
tchExecution.scala:236)
[info] org.apache.spark.sql.execution.streaming.MicroBatchExecution$$Lambda$1357/1672418781.apply$mcZ$sp(
Unknown Source)
[info] org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
[info] org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecutio
n.scala:180)
[info] org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$
StreamExecution$$runStream(StreamExecution.scala:345)
[info] org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:257)
[info]
[info]
[info] == Sink ==
[info] 0: [2] [3] [4]

Testing
How to use StreamTest?
a) Copy the code from the Spark repository to your project (recommended)
- Isolates you from changes in open source that may break your build

Testing
How to use StreamTest?
b) Import the spark-sql test jars
Maven:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.1</version>
<scope>test</scope>
<type>test-jar</type>
</dependency>
SBT:
"org.apache.spark" %% "spark-sql” % "2.4.0" % "test" classifier "tests"

Testing
Strategy 2: Leverage the StreamTest test harness available in Apache
Spark
Pros:
- A great test harness for free!
- Quick and cheap way to test
business logic
Cons:
- Only available in Scala

Testing
Strategy 3: Integration testing using Databricks Jobs
1. Have a replica of production in a staging account
2. Use Databricks REST APIs/Airflow/Azure Data Factory to kick off a
single-run job
3. Verify data output, data latency, job duration
Pros:
- Closest option to mirror
production
Cons:
- Hard to set up
- Expensive

Testing
What else to watch out for?
- Table schemas: Changing the schema/logic of one stream upstream
can break cascading jobs
Stay tuned for Spark Summit Europe!
- Dependency hell: The environment your local machine or Continuous
Integration service may differ from Production
Check out Databricks Container Services!

Testing
What else to watch out for?
- Stress Testing: Most times Spark isn’t the bottleneck. In fact,
throwing more money at your Spark clusters make the problem worse!
a) Don’t forget to tune your Kafka brokers (num.io.threads,
num.network.threads)
b) Most cloud services have rate limits, make sure you avoid them as
much as you can

Testing Best Practices
1. Leverage the StreamTest harness
for unit tests
- Use MemorySource and
MemorySink to test business logic

2. Maintain a staging environment to
integration test before pushing to
production
- You can use Databricks Jobs and
Databricks Container Services to
ensure you have a replica of your
production environment

3. Don’t forget to test data dependencies, schema changes upstream
can break downstream jobs

4. Perform stress tests in staging
environment to have a runbook for
production. Not all problems lie in
your Spark cluster.

Monitoring
Get last progress of the
streaming query
Current input and processing rates
Current processed offsets
Current state metrics
Get progress asynchronously
through by registering your own
StreamingQueryListener
new StreamingQueryListener {
def onQueryStart(...)
def onQueryProgress(...)
def onQueryTermination(...)
}
streamingQuery.lastProgress()
{ ...
"inputRowsPerSecond" : 10024.225210926405,
"processedRowsPerSecond" : 10063.737001006373,
"durationMs" : { ... },
"sources" : [ ... ],
"sink" : { ... }
...
}

Monitoring
Leverage the StreamingQueryListener API
Push data to:
- Azure Monitor
- AWS CloudWatch
- Apache Kafka
{
"id" : "be3ff70b-d2e7-428f-ac68-31ee765c7744",
"runId" : "2302c661-ae0f-4a52-969f-c0d62899af06",
"name" : null,
"timestamp" : "2019-04-23T00:32:26.146Z",
"batchId" : 3,
"numInputRows" : 4316,
"inputRowsPerSecond" : 169.45425991362387,
"processedRowsPerSecond" : 158.81660288489846,
"durationMs" : {
"addBatch" : 26364,
"getBatch" : 6,
"getOffset" : 23,
"queryPlanning" : 12,
"triggerExecution" : 27176,
"walCommit" : 365
},
"stateOperators" : [ ...],
"sources" : [ ... ],
"sink" : { "description" : ... }
}

Monitoring
Even if you are running a map-only job, you can add a watermark
- This allows you to collect event time min, max, average in metrics
You can add current_timestamp() to keep track of ingress timestamps
- udf(() => new java.sql.Timestamp(System.currentTimeMillis)) to get
accurate processing timestamp

Monitoring
Start streams on your tables for monitoring and build streaming
dashboards in Databricks!
• Use display(streaming_df) to get live updating displays in Databricks
• Use foreach/foreachBatch to trigger alerts

Deploying
Where to deploy this many (hundreds of) streams?
a) Each stream gets a cluster
Pros: Cons:
+ Better isolation - Costly
- More moving parts
b) Multiplex many streams on a single cluster
Pros: Cons:
+ Better cluster utilization - Driver becomes a bottleneck
+ Potential Delta Cache - Determining how many is difficult
re-use - Load balancing streams across
clusters also difficult

Deploying
What causes bottlenecks in the Driver?
1. Locks!
- JSON Serialization of offsets in streaming (Jackson)
- Scala compiler (Encoder creation)
- Hadoop Configurations (java.util.Properties)
- Whole Stage Codegen (ClassLoader.loadClass)
2. Garbage Collection

Deploying
How many streams can you run on a single driver?
- Depends on your streaming sources and sinks
Sources: Sink:
1. Delta Lake 1. Kafka
2. Event Based File Sources 2. Delta Lake
3. Kafka / Azure EventHub / Kinesis 3. Other File Formats
4. Other File Sources (JSON/CSV)
Efficiency

Deploying
How many streams can you run on a single driver?
- ~80 S3-SQS => Delta Streams at modest data rates
- ~40 Delta => Delta Streams at high data rates
After removing most locks, we got to 200 Delta => Delta streams at
modest data rates, with 40 streams per SparkSession

Updating
spark.readStream
.format("kafka")
.load()
.writeStream
.format("kafka")
.start()
Checkpoint
• Tracks the progress of a
query in persistent storage
• Can be used to restart the
query if there is a failure.

Updating
The Checkpoint:
- The checkpoint location is the unique identity of your stream
- Contains:
a) The id of the stream (json file named metadata)
b) Source offsets (folder named sources, contains json files)
c) Aggregation state (folder named state, contains binary files)
d) Commit files (folder named commits, contains json files)
e) Source Metadata (folder named sources)

Updating
Based on files stored in a checkpoint, what can you change?
1. Sinks
2. Input/Output schema (in the absence of stateful operations)
3. Triggers
4. Transformations
5. Spark Versions

Updating
Based on files stored in a checkpoint, what can’t you change?
1. Stateful operations: agg, flatMapGroupsWithState, dropDuplicates,
join
- Schema: key, value
- Parallelism: spark.sql.shuffle.partitions
- Can’t add or remove stateful operators
2. Output Mode (will work, but semantics of stream has changed)
3. Sources

Updating
How to workaround limitations?
• Restart stream from scratch
• Use new checkpoint location – avoid
eventual consistency on S3
• Partition source tables by date,
restart stream from a given date

Operating Pipelines Are Hard
Stay Tuned for:

Thank You
“Do you have any questions for my prepared answers?”
– Henry Kissinger

Productizing Structured Streaming Jobs

More Related Content

What's hot

What's hot (20)

Similar to Productizing Structured Streaming Jobs

Similar to Productizing Structured Streaming Jobs (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Productizing Structured Streaming Jobs