ABD217_From Batch to Streaming

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
From Batch to Streaming:
H o w A m a z o n F l e x U s e s R e a l - t i m e A n a l y t i c s t o D e l i v e r P a c k a g e s o n T i m e
N o v e m b e r 2 8 , 2 0 1 7

Agenda
• Real-time streaming data overview
• Streaming data services
• Benefits of streaming analytics
• Batch to streaming best practices
• How Amazon Flex moved from batch to streaming

What is batch processing?
Execution of a series of jobs in a program on a
computer without manual intervention - Wikipedia
• Data is collected over a period of time
• Process and analyze on a schedule
• Combine several processes to obtain final result

Most data is produced continuously
Mobile apps Web clickstream Application logs
Metering records IoT sensors Smart buildings

The diminishing value of data
Recent data is highly valuable
• If you act on it in time
• Perishable insights (M. Gualtieri,
Forrester)
Old + recent data is more
valuable
• If you have the means to combine
them

Processing real-time, streaming data
• Durable
• Continuous
• Fast
• Correct
• Reactive
• Reliable
What are the key requirements?
Collect Transform Analyze React Persist

Amazon Kinesis makes it easy to work with real-
time streaming data
Kinesis Streams
• For technical developers
• Collect and stream data
for ordered, replayable,
real-time processing
Kinesis Firehose
• For all developers, data
scientists
• Easily load massive
volumes of streaming data
into Amazon S3, Redshift,
ElasticSearch
Kinesis Analytics
• For all developers, data
scientists
• Easily analyze data streams
using standard SQL queries
• Compute analytics in
real time

Amazon Kinesis Streams
• Reliably ingest and durably store streaming data at low cost
• Build custom real-time applications to process streaming data
• Use your stream-processing framework of choice

Amazon Kinesis Firehose
• Reliably ingest and deliver batched, compressed, and
encrypted data to S3, Redshift, and Elasticsearch
• Point and click setup with zero administration and
seamless elasticity
• Managed stream-processing consumer

Amazon Kinesis Analytics
• Interact with streaming data in real time using SQL
• Build fully managed and elastic stream processing
applications that process data for real-time
visualizations and alarms

Benefits of streaming analysis
Immediate results
• Real-time
aggregations
• Filtering
• Anomaly detection
Reduced
complexity
• Fewer scheduled
jobs to manage
• Kinesis is a fully-
managed solution
Scalable
• Enables parallel
processing
• Horizontally
scales, based on
your ingest rate

Batch to streaming best practices
Migrate incrementally
• Don’t boil the ocean
• Begin by streaming data
in parallel to existing
batch processes
• Persist streaming data
into durable storage, like
Amazon S3
• Add in streaming
analysis results to
replace batch analysis
Application databases Data warehouseData producer
Amazon Kinesis
ETL
ETL
Amazon S3
Streaming
data

Perform ITL rather than ETL
• ITL: Ingest-Transform-Load
• ETL: Extract-Transform-Load
• Transform data in near-real time
rather than a scheduled job
• Enrich data in near-real time
• Persist transformed and/or
enriched data
Data producer
Amazon Kinesis
Firehose
Raw streaming
data
AWS Lambda
function
Amazon S3
Transformed
data
Transform
data
Enrichment
source data
Raw data Transformed and/or
enriched data

Aggregate upon arrival
• Continuously write raw data
to persistent data store for
archival and other analysis
• Aggregate in real time when
window size < 1 hour
• Write aggregated data to
persistent data store for
immediate value
Amazon Kinesis
Firehose
Raw streaming
data
Amazon S3
Raw
data
Aggregated
data
Amazon Kinesis
Analytics
Aggregate
Results
Data producer

Batch to streaming example

Brandon Smith
• Senior software engineer
• Worked at Amazon for 12 years in Kindle, AWS, and now Last Mile Delivery
• Currently working on Amazon Flex

• Amazon delivery app (Android/iOS)
• Crowd-sourced model launched in
30+ U.S. cities
• Used by Amazon Logistics worldwide

• Deliveries for Amazon.com, Prime
Now, Amazon Fresh, restaurants,
grocery stores
• Millions of packages per year

The problem
• Collecting, processing, and storing telemetry data
• Telemetry data = remote measurements
• Includes metrics, crashes, logs, sensor data, clickstream data, etc.

The goal
• Understand what’s happening in the field
• Analyze all the data and make performance optimizations
• Focus our time on improving the app and the delivery flow

Use cases

Use case 1: Alarming
• We want to know within minutes if there are problems
• Example: If the delivery count drops below our expected/historical value,
we want to alarm

Use case 2: Troubleshooting
• Logs and crashes published to AWS CloudWatch Logs in near-real time
• Can filter and search to troubleshoot issues

Use case 3: Dashboards
• We can write SQL, generate reports, and create visualizations
• But we really want real-time dashboards instead of daily reports
Daily reports Real-time dashboards

Use case 4: Releases
• Deploying new app versions and monitoring adoption in real time
• Release new code smoothly and with confidence

Use case 5: Sharing data
• Consumers get notifications of new data in real time
• Consumers can join their data with other data in the data lake
S3 bucket Data lake

Use case 6: Deeper analytics
• Look at the stream of data and the historical data
• Build ML models, create predictions, detect anomalies

How did we build it?

Getting from batch to streaming
• To solve our use cases, we had to incrementally improve our system
• We evolved from a batch-based system to a stream-based system
• Let’s walk through the iterations

• Collect metrics and send to an existing metrics service
• ETL jobs to load data into a big Oracle Data Warehouse
Iteration 1: Use existing systems
Existing metrics serviceApp DW
ETL
Data
collection

1. Batch process with 24-hour delay
2. Fixed, inflexible DB schema
3. Analysis difficult and slow via SQL
Iteration 1: Use existing systems
Existing metrics serviceApp DW
ETL
Data
collection

• Collect metrics in the app using AWS Amazon Mobile Analytics SDK,
which automatically loads data into Redshift
Iteration 2: Use AWS
App
CloudFormation
ETL system
Data
collection

1. Batch process with 24-hour delay 2-hour delay
2. Fixed, inflexible DB schema
Iteration 2: Use AWS
App
CloudFormation
ETL system
Data
Collection

• Add shared configuration that is used in the app and automatically
updates the Redshift schema
Iteration 3: Automated DB schema
App
CloudFormation
ETL system
Data
collection
Schema config

1. Batch process with 24-hour delay 2-hour delay
2. Fixed, inflexible Auto-updating DB schema
Iteration 3: Automated DB schema
App
Schema config
CloudFormation
ETL system
Data
collection

• Introduce a Kinesis stream and Kinesis Firehose to publish to Redshift
• Partition data by date to simplify data retention policies
Iteration 4: Use Streams
App
Data
collection Via Pinpoint
Schema
config

1. Batch Streaming process with 24-hour 2 hour a delay of a couple
minutes
2. Fixed, inflexible Auto-updating DB schema
Iteration 4: Use Streams
App
Data
collection Via Pinpoint
Schema
config

• Use generic message types
• Publish the data to:
• S3
• Redshift
• ElasticSearch
Iteration 5: Generic message types
App
ElasticSearch

Iteration 5
App
Data
collection
ElasticSearch
Consumer Lambdas
SQL reports
Dashboards
ProtoBuf
Consumer Redshifts

1. Batch Streaming process with 24-hour 2 hour a few seconds delay
2. Fixed, inflexible Auto-updating DB schema and generic message types
3. Analysis difficult and slow via SQL flexible by processing message payload
Iteration 5: Generic message types

Data flow
App
ElasticSearch
Consumer Redshifts
Consumer Lambdas
SQL reports
Dashboards

Future improvements
Some ideas to make the system even better:
1. Use Kinesis Analytics to query the real-time data stream
2. Use AWS Athena to query data directly from S3
3. Use AWS Amazon AI Services to do deeper data analysis

Summary
Did we solve our use cases?
1. Real-time metrics and alarming
2. Real-time dashboards
3. Real-time logs and crash troubleshooting
4. Monitoring new releases
5. Sharing data with other teams
6. Deeper analytics

Benefits of Streaming
1. Agility: real-time data means your business can react quicker
2. Flexibility: generic message types give you flexible schemas so your
system can handle multiple data types and future use cases
3. Shareability: streams allow you to multiplex and share your data easily
with your consumers
4. Extensibility: Processing streams of data allows us to write it to
multiple data storage systems, which enables a variety of analytics
tools

Thank you!

ABD217_From Batch to Streaming

More Related Content

What's hot

What's hot (20)

Similar to ABD217_From Batch to Streaming

Similar to ABD217_From Batch to Streaming (20)

More from Amazon Web Services

More from Amazon Web Services (20)

ABD217_From Batch to Streaming