AWS Summit Singapore - Architecting a Serverless Data Lake on AWS

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unni Pillai
Specialist Solutions Architect, Amazon Web Services
Daniel Muller
Head of Cloud Infrastructure, Spuul
Architecting a Serverless
Data Lake on AWS

What is a Data Lake ?
A data lake is an architectural
approach that allows you to
store massive amounts of data
into a central location, so it's
readily available to be
categorized, processed,
analyzed and consumed by
diverse group of users within
an organization.
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
100110000100101011100
101010111001010100001
011111011010
0011110010`110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time

Challenges faced by data teams
Transactions
ERP
Sensor Data
Billing
Web Logs
Social
Infrastructure Logs
E x p o n e n t i a l g r o w t h i n d a t a D i v e r s i f i e d c o n s u m e r s
Data Scientists
Business Analyst External Consumers
Applications
M u l t i p l e a c c e s s m e c h a n i s m s
API Access
BI Tools
Notebooks

Characteristics of a data lake
Future
Proof
Flexible
Access
Dive in
Anywhere
Collect
Anything

Let’s take an example
Record-level dataDevice Data
Serverless
Ingest & store data in real-time
Discover and catalog data stored in the lake
Enable batch & real-time processing
Consume raw & processed data
Scalable, Highly Available, Pay what you use
Design Outcomes

Demo - Architecture
Data Source Ingest & Store Catalog &
Transform
Analyze &
Consume
Device 1
Device 100
.
.
.
.
.
.

Multiple data lake ingestion methods
AWS Snowball and AWS Snowmobile
• PB-scale migration
AWS Storage Gateway
• Migrate legacy files
Native/ISV Connectors
• Ecosystem integration
Amazon S3 Transfer Acceleration
• Long-distance data transfer
AWS Direct Connect
• On-premises integration
Amazon Kinesis Firehose
• Ingest device streams
• Transform and store on
Amazon S3

Amazon Kinesis - real-time analytics
Easily collect, process, and analyze video and data streams in real time
Load data streams
into AWS data stores
Analyze data streams
with SQL
Build custom
applications that
analyze data streams
Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
Capture, process,
and store video
streams for analytics
Kinesis Video Streams

Serverless data delivery with Kinesis
Firehose
Firehose
Data Producers
Amazon S3
Amazon
Redshift
Amazon
Elasticsearch
Destinations
AWS Lambda
Inline
Transform

Demo - Architecture
Transform
Analyze &
Consume
Kinesis FirehoseDevice 1
Device 100
.
.
.
.
.
.

Unmatched durability,
availability, and scalability
Best security, compliance, and audit
capability
Object-level control
at any scale
Business insight into
your data
Twice as many partner
integrations
Most ways to bring
data in
Amazon S3 - Infinite, Durable & Cost Effective
Storage

Demo - Architecture
Transform
Analyze &
Consume
Kinesis Firehose
raw
Raw Data S3 Bucket
Device 1
Device 100
.
.
.
.
.
.

Data Lake Metadata Management
Discover, Catalog & ETL

AWS Glue—data catalog
Make data discoverable
Automatically discovers data and stores schema
Catalog makes data searchable, and available for ETL
Catalog contains table and job definitions
Computes statistics to make queries efficient
Compliance
Glue
Data Catalog
Discover data and
extract schema

AWS Glue—ETL service
Make ETL scripting and deployment easy
Serverless Transformations
Based on Apache Spark
Automatically generates ETL code
Code is customizable with PySpark and Scala
Endpoints provided to edit, debug, test code
Jobs are scheduled or event-based

Demo - Architecture
Transform
Analyze &
Consume
Processed S3 Bucket
Kinesis Firehose
raw
Raw Data S3 Bucket
Glue Crawler
Glue Catalog
Device 1
Device 100
.
.
.
.
.
.
Glue ETL

Amazon EMR - Big Data Processing
Fully managed – Hadoop Framework
19 Apps : Hadoop, Hive , Spark, HBase, Presto, and more
S3 Integration – Decouple Compute and Storage
Low Cost – Transient Clusters, Per Second Pricing, Spot Instances
Amazon Redshift - Modern Data Warehousing
Fast, scalable, fully managed EDW at 1/10th the cost of other EDWs
Massively parallel, scales from gigabytes to exabytes
Access data across your Redshift DW and Amazon S3 data lake
Amazon
Redshift

Amazon Athena—Interactive Analysis
$
SQL
Query Instantly
Zero setup cost; just
point to S3 and start
querying
Pay per query
Pay only for queries run;
save 30–90% on per-
query costs through
compression
Open
ANSI SQL interface,
JDBC/ODBC drivers, multiple
formats, compression types,
and complex joins and data
types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with QuickSight
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load

Demo - Architecture
Transform
Analyze &
Consume
Processed S3 Bucket
Glue Catalog
Kinesis Firehose
raw
Raw Data S3 Bucket
Glue Crawler
Athena
Quick SightDevice 1
Device 100
.
.
.
.
.
.
Glue ETL

Daniel Muller
Head of Cloud Infrastructure, Spuul
danielmullerch

v Leading OTT player
v Indian Movies, Shows and Live TV
v 50 millions registered users
v Users in 5 continents
v Content served on Mobile,
connected TVs, SetTop Boxes
v (Coming Soon..!) Non-Indian content

Why we built a serverless data lake ?
• 100+ event types - across microservices & devices
• Flexibility - Ingestion, Consumption
• Bottomless storage - Cheap & Reliable
• Ad-hoc querying – Analyze data without a data-warehouse
• Future use cases - 3rd party integrations
• We hate managing servers..! #NoMoreServers

Lessons Learnt & Best Practices
Use a framework - SAM, Serverless, Apex/Up
Store data in raw format - debugging and re-processing
Convert to Columnar Formats - Optimized for reads
Partition data - Based on your filters
Specify columns to load - Reduce data transfer
Create files of ~100MB - Reduces S3 list API calls
Compress Data in Lake – Reduce network transfers
Use Lambda for Automation – Wire things together

Demo - Architecture
Transform
Analyze &
Consume
Processed S3 Bucket
Kinesis Firehose
raw
Raw Data S3 Bucket
Athena
Quick SightDevice 1
Device 100
.
.
.
.
.
.
Glue Catalog
Glue Crawler
Glue ETL

Take the demo home…
http://bit.ly/sg-summit-datalake-demo

Central Storage
Secure, Cost Effective
Storage in S3
S3
Kinesis Direct Connect Snowball DMS
Data Ingestion
Get your data into S3
quickly and securely
Athena Quicksight EMR Redshift
Processing & Analytics
Use predictive and prescriptive
analytics to gain better understanding
Glue ETL
Protect & Secure
Use entitlements to ensure data is secure and users identities are verified
Security Token
Service
Cloudwatch Cloudtrail KMS
Catalog & Search
Access & Search Metadata
DynamoDB Amazon ESGlue Catalog
Access & User Interface
Give your users easy & secure access
API Gateway IAM Cognito

Thank You !
unni_k_pillai
Take the demo home
http://bit.ly/sg-summit-datalake-demo
Survey
&
Feedback
#AWSSummitSG
#NoMoreServers
#DataLake

AWS Summit Singapore - Architecting a Serverless Data Lake on AWS

Related slideshows

More Related Content

AWS Summit Singapore - Architecting a Serverless Data Lake on AWS