Real-time Data Processing Using AWS Lambda

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tara E. Walker
AWS Technical Evangelist
tarawalk@amazon.com / @taraw)
October 24, 2016
Real-time Data Processing
Using AWS Lambda

Agenda
AWS Services for Real Time Data Processing
AWS Lambda
Amazon Kinesis
Architecture & Workflow for Streaming Data
Processing
Streaming Data Processing Demo
Best Practices in Building Data Processing Solutions

AWS Services for Real-Time Data
Processing
Amazon
Kinesis
AWS
Lambda

AWS Lambda: Overview
Lambda functions: a piece of code with stateless execution
Triggered by events:
• Direct Sync and Async API calls
• AWS Service integrations
• 3rd party triggers
• And many more …
Makes it easy to:
• Perform data-driven auditing, analysis, and notification
• Build back-end services that perform at scale

AWS Lambda: Serverless Compute in the Cloud
Compute service that runs your code in response to events
Easy to author, deploy, maintain, secure and manage
Allows for focus on business logic
Stateless, event-driven code with native support for
Node.js, Java, and Python languages
Compute & Code without managing infrastructure like EC2
instances and auto scaling groups
Makes it easy to build back-end services that perform at scale

High performance at any scale;
Cost-effective and efficient
No Infrastructure to
manage
Pay only for what you use: Lambda
automatically matches capacity to
your request rate. Purchase compute
in 100ms increments.
Bring Your
Own Code
“Productivity focused compute platform to build powerful, dynamic,
modular applications in the cloud”
Run code in a choice of
standard languages. Use
threads, processes, files, and
shell scripts normally.
Focus on business logic, not
infrastructure. You upload code;
AWS Lambda handles everything
else.
Benefits of AWS Lambda for building a serverless
data processing engine
1 2 3

Amazon Kinesis
Firehose
Easily load massive
volumes of streaming
data into Amazon S3
and Redshift
Amazon Kinesis
Analytics
Easily analyze data
streams using standard
SQL queries
Amazon Kinesis
Streams
Build your own custom
applications that
process or analyze
streaming data
Amazon Kinesis: Overview
Managed services for streaming data ingestion and processing

Amazon Kinesis: Streaming data done the AWS way
Makes it easy to capture, deliver, and process real-time data streams
Pay as you go, no up-front costs
Right services for your specific use cases
Real-time latencies
Easy to provision, deploy, and manage

Benefits of Amazon Kinesis for stream data
ingestion and continuous processing
Real-time Ingest
Highly Scalable
Durable
Replay-able Reads
Continuous Processing
GetShardIterator and GetRecords(ShardIterator)
Allows checkpointing/ replay
Enables multi concurrent processing
KCL, Firehose, Analytics, Lambda
Enable data movement into many Stores/ Processing Engines
Managed Service
Low end-to-end latency

Data Processing/Streaming
Architecture & Workflow
Smart
Devices
Click
Stream
Log
Data

AWS Lambda and Amazon Kinesis integration
How it Works
Stream-based model (Pull):
▪ Lambda polls the stream and batches available records
▪ Batches are passed for invocation to Lambda through function
parameters
▪ Kinesis mapped as Event source in Lambda
Synchronous invocation:
▪ Lambda invoked as synchronous RequestResponse type
▪ Lambda function is executed at least once
▪ Each shard blocks on in order synchronous invocation
Event structure:
▪ Event received by Lambda function is a collection of records from
Kinesis stream
▪ Customer defines max batch size, not effective batch size

Streaming Architecture Workflow: Lambda + Kinesis
Data Input Kinesis Action Lambda Data Output
IT application activity
Capture the
stream
Audit
Process the
stream
SNS
Metering records Condense Redshift
Change logs Backup S3
Financial data Store RDS
Transaction orders Process SQS
Server health metrics Monitor EC2
User clickstream Analyze EMR
IoT device data Respond Backend endpoint
Custom data Custom action Custom application

Common Architecture: Lambda + Kinesis
Real Time Data Processing
Amazon
Kinesis
AWS
Lambda 1
Amazon
CloudWatch
Amazon
DynamoDB
AWS
Lambda 2 Amazon
S3
1. Real-time event data sent to Amazon
Kinesis, allows multiple AWS Lambda
functions to process the same events.
2. In AWS Lambda, Function 1 processes
and aggregates data from incoming
events, then stores result data in
Amazon DynamoDB
3. Lambda Function 1 also sends values to
Amazon CloudWatch for simple
monitoring of metrics.
4. In AWS Lambda function, Function 2
does data manipulation of incoming
events and stores results in Amazon S3
https://s3.amazonaws.com/awslambda-reference-
architectures/stream-processing/lambda-refarch-
streamprocessing.pdf

Common Architecture: Lambda + Kinesis
Data Processing for Data Storage/Analysis
Use AWS Lambda to process
and “fan out” to other AWS
services i.e. Storage,
Database, and BI/analytics
Amazon Kinesis stream can
continuously capture and
store terabytes of data per
hour from hundreds of
thousands of sources
Grant AWS Lambda
permissions for the relevant
stream actions via IAM
(Execution Role) during
function creation
IAM
IAM
IAM

Demo: Real time processing of
Amazon Kinesis data streams with
AWS Lambda

Data Processing:
Best Practices & Tips

Best Practices
Creating a Kinesis stream
Streams
▪ Made up of Shards
▪ Each Shard ingests data up to 1MB/sec
▪ Each Shard emits data up to 2MB/sec
▪ Determine an initial size/shards
 Leverage “Help me decide how many shards I need” in Console
 Use formula for Number Of Shards = max(incoming_write_bandwidth_in_KB/1000,
outgoing_read_bandwidth_in_KB / 2000)
Data
▪ All data is stored for 24 hours, Replay data inside of 24hr window
▪ A Partition Key is supplied by producer and used to distribute the PUTs across
Shards
▪ A unique Sequence # is returned to the Producer upon a successful PUT call
▪ Make sure partition key distribution is even to optimize parallel throughput

Best Practices
Creating Lambda functions
Code:
▪ Write your Lambda function code in a stateless style
▪ Instantiate AWS clients & database clients outside the scope of the handler to take
advantage of connection re-use.
Memory:
▪ CPU and disk proportional to the memory configured
▪ Increasing memory makes your code execute faster (if CPU bound)
▪ Increasing memory allows for larger record sizes processed
Timeout:
▪ Increasing timeout allows for longer functions, but more wait in case of errors
Retries:
▪ For Kinesis, Lambda retries until the data expires (default 24 hours)
Permission model:
• The execution role defined for Lambda must have permission to access the stream

Best Practices
Configuring Lambda with Kinesis as an event source
Batch size:
▪ Max number of records that Lambda will send to one invocation
▪ Not equivalent to effective batch size
▪ Effective batch size is every 250 ms
MIN(records available, batch size, 6MB)
▪ Increasing batch size allows fewer Lambda function invocations with
more data processed per function

Best Practices
Configuring Lambda with Kinesis as an event source
Starting Position:
▪ The position in the stream where Lambda starts reading
▪ Set to “Trim Horizon” for reading from start of stream (all data)
▪ Set to “Latest” for reading most recent data (LIFO) (latest data)

Best Practices
Attaching a Lambda function to a Kinesis Stream
▪ One Lambda function concurrently invoked per Kinesis shard
▪ Increasing # of shards with even distribution allows increased concurrency
▪ Lambda blocks on ordered processing for each individual shard
▪ This makes duration of the Lambda function directly impact throughput
▪ Batch size may impact duration if the Lambda function takes longer to process
more records
… …
Source
Kinesis
Destination
1
Lambda
Destination
2
FunctionsShards
Lambda will scale automaticallyScale Kinesis by adding shards
Waits for responsePolls a batch

Best Practices
Tuning throughput
▪ Maximum theoretical throughput :
# shards * 2MB / Lambda function duration (s)
▪ Effective theoretical throughput :
# shards * batch size (MB) / Lambda function duration (s)
▪ If put / ingestion rate is greater than the theoretical throughput, your processing
is at risk of falling behind
… …
Source
Kinesis
Destination
1
Lambda
Destination
2
FunctionsShards
Lambda will scale automaticallyScale Kinesis by splitting or merging shards
Waits for responsePolls a batch

Best Practices
Tuning throughput
▪ Retry execution failures until the record is expired
▪ Retry with exponential backoff up to 60s
▪ Throttles and errors impacts duration and directly impacts throughput
▪ Effective theoretical throughput :
( # shards * batch size (MB) ) / ( function duration (s) * retries until expiry)
… …
Source
Kinesis
Destination
1
Lambda
Destination
2
FunctionsShards
Lambda will scale automaticallyScale Kinesis by splitting or merging shards
Receives errorPolls a batch
Receives error
Receives success

Best Practices
Monitoring Kinesis Streams with Amazon Cloudwatch Metrics
•GetRecords (effective throughput) : bytes, latency, records etc
•PutRecord : bytes, latency, records, etc
•GetRecords.IteratorAgeMilliseconds: how old your last processed records were.
If high, processing is falling behind. If close to 24 hours, records are close to
being dropped.

Best Practices
Monitoring Lambda functions
•Monitoring: available in Amazon CloudWatch Metrics
• Invocation count
• Duration
• Error count
• Throttle count
•Debugging: available in Amazon CloudWatch Logs
• All Metrics
• Custom logs
• RAM consumed
• Search for log events

Best Practices
Create different Lambda functions for each task, associate to same
Kinesis stream
Log to
CloudWatch
Logs
Push to SNS

Get Started: Data Processing with AWS
Next Steps
1. Create your first Kinesis stream. Configure hundreds of thousands
of data producers to put data into an Amazon Kinesis stream. Ex.
data from Social media feeds.
2. Create and test your first Lambda function. Use any third party
library, even native ones. First 1M requests each month are on us!
3. Read the Developer Guide, AWS Lambda and Kinesis Tutorial, and
resources on GitHub at AWS Labs
• http://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html
• https://github.com/awslabs/lambda-refarch-streamprocessing
• https://github.com/awslabs/lambda-streams-to-firehose

Tara E. Walker
AWS Technical Evangelist
@taraw

Real-time Data Processing Using AWS Lambda

More Related Content

Real-time Data Processing Using AWS Lambda