Unni Pillai, Specialist Solution Architect, ASEAN, AWS.
Daniel Muller, Head of Cloud Infrastructure, Spuul.
As the volume and types of data continues to grow, customers often have valuable data that is not easily discoverable and available for analytics. A common challenge for data engineering teams is architecting a data lake that can cater to the needs of diverse users - from developers to business analysts to data scientists.
In this session, we will dive deep into building a data lake using Amazon S3, Amazon Kinesis, Amazon Athena and AWS Glue. We will also see how AWS Glue crawlers can automatically discover your data, extracting and cataloguing relevant metadata to reduce operations in preparing your data for downstream consumers.
Furthermore, learn from our customer Spuul, on how they moved from a Data Warehouse based analytics to a serverless data lake. Why and how did Spuul undertake this journey? Hear about the benefits and challenges they encountered.
2. What is a Data Lake ?
A data lake is an architectural
approach that allows you to
store massive amounts of data
into a central location, so it's
readily available to be
categorized, processed,
analyzed and consumed by
diverse group of users within
an organization.
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
100110000100101011100
101010111001010100001
011111011010
0011110010`110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time
3. Challenges faced by data teams
Transactions
ERP
Sensor Data
Billing
Web Logs
Social
Infrastructure Logs
E x p o n e n t i a l g r o w t h i n d a t a D i v e r s i f i e d c o n s u m e r s
Data Scientists
Business Analyst External Consumers
Applications
M u l t i p l e a c c e s s m e c h a n i s m s
API Access
BI Tools
Notebooks
4. Characteristics of a data lake
Future
Proof
Flexible
Access
Dive in
Anywhere
Collect
Anything
5. Let’s take an example
Record-level dataDevice Data
Serverless
Ingest & store data in real-time
Discover and catalog data stored in the lake
Enable batch & real-time processing
Consume raw & processed data
Scalable, Highly Available, Pay what you use
Design Outcomes
8. Multiple data lake ingestion methods
AWS Snowball and AWS Snowmobile
• PB-scale migration
AWS Storage Gateway
• Migrate legacy files
Native/ISV Connectors
• Ecosystem integration
Amazon S3 Transfer Acceleration
• Long-distance data transfer
AWS Direct Connect
• On-premises integration
Amazon Kinesis Firehose
• Ingest device streams
• Transform and store on
Amazon S3
9. Amazon Kinesis - real-time analytics
Easily collect, process, and analyze video and data streams in real time
Load data streams
into AWS data stores
Analyze data streams
with SQL
Build custom
applications that
analyze data streams
Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
Capture, process,
and store video
streams for analytics
Kinesis Video Streams
10. Serverless data delivery with Kinesis
Firehose
Firehose
Data Producers
Amazon S3
Amazon
Redshift
Amazon
Elasticsearch
Destinations
AWS Lambda
Inline
Transform
13. Unmatched durability,
availability, and scalability
Best security, compliance, and audit
capability
Object-level control
at any scale
Business insight into
your data
Twice as many partner
integrations
Most ways to bring
data in
Amazon S3 - Infinite, Durable & Cost Effective
Storage
14. Demo - Architecture
Data Source Ingest & Store Catalog &
Transform
Analyze &
Consume
Kinesis Firehose
raw
Raw Data S3 Bucket
Device 1
Device 100
.
.
.
.
.
.
16. AWS Glue—data catalog
Make data discoverable
Automatically discovers data and stores schema
Catalog makes data searchable, and available for ETL
Catalog contains table and job definitions
Computes statistics to make queries efficient
Compliance
Glue
Data Catalog
Discover data and
extract schema
17. AWS Glue—ETL service
Make ETL scripting and deployment easy
Serverless Transformations
Based on Apache Spark
Automatically generates ETL code
Code is customizable with PySpark and Scala
Endpoints provided to edit, debug, test code
Jobs are scheduled or event-based
18. Demo - Architecture
Data Source Ingest & Store Catalog &
Transform
Analyze &
Consume
Processed S3 Bucket
Kinesis Firehose
raw
Raw Data S3 Bucket
Glue Crawler
Glue Catalog
Device 1
Device 100
.
.
.
.
.
.
Glue ETL
20. Amazon EMR - Big Data Processing
Fully managed – Hadoop Framework
19 Apps : Hadoop, Hive , Spark, HBase, Presto, and more
S3 Integration – Decouple Compute and Storage
Low Cost – Transient Clusters, Per Second Pricing, Spot Instances
Amazon Redshift - Modern Data Warehousing
Fast, scalable, fully managed EDW at 1/10th the cost of other EDWs
Massively parallel, scales from gigabytes to exabytes
Access data across your Redshift DW and Amazon S3 data lake
Amazon
Redshift
21. Amazon Athena—Interactive Analysis
$
SQL
Query Instantly
Zero setup cost; just
point to S3 and start
querying
Pay per query
Pay only for queries run;
save 30–90% on per-
query costs through
compression
Open
ANSI SQL interface,
JDBC/ODBC drivers, multiple
formats, compression types,
and complex joins and data
types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with QuickSight
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
22. Demo - Architecture
Data Source Ingest & Store Catalog &
Transform
Analyze &
Consume
Processed S3 Bucket
Glue Catalog
Kinesis Firehose
raw
Raw Data S3 Bucket
Glue Crawler
Athena
Quick SightDevice 1
Device 100
.
.
.
.
.
.
Glue ETL
24. v Leading OTT player
v Indian Movies, Shows and Live TV
v 50 millions registered users
v Users in 5 continents
v Content served on Mobile,
connected TVs, SetTop Boxes
v (Coming Soon..!) Non-Indian content
25. Why we built a serverless data lake ?
• 100+ event types - across microservices & devices
• Flexibility - Ingestion, Consumption
• Bottomless storage - Cheap & Reliable
• Ad-hoc querying – Analyze data without a data-warehouse
• Future use cases - 3rd party integrations
• We hate managing servers..! #NoMoreServers
27. Lessons Learnt & Best Practices
Use a framework - SAM, Serverless, Apex/Up
Store data in raw format - debugging and re-processing
Convert to Columnar Formats - Optimized for reads
Partition data - Based on your filters
Specify columns to load - Reduce data transfer
Create files of ~100MB - Reduces S3 list API calls
Compress Data in Lake – Reduce network transfers
Use Lambda for Automation – Wire things together
29. Demo - Architecture
Data Source Ingest & Store Catalog &
Transform
Analyze &
Consume
Processed S3 Bucket
Kinesis Firehose
raw
Raw Data S3 Bucket
Athena
Quick SightDevice 1
Device 100
.
.
.
.
.
.
Glue Catalog
Glue Crawler
Glue ETL
30. Take the demo home…
http://bit.ly/sg-summit-datalake-demo
31. Central Storage
Secure, Cost Effective
Storage in S3
S3
Kinesis Direct Connect Snowball DMS
Data Ingestion
Get your data into S3
quickly and securely
Athena Quicksight EMR Redshift
Processing & Analytics
Use predictive and prescriptive
analytics to gain better understanding
Glue ETL
Protect & Secure
Use entitlements to ensure data is secure and users identities are verified
Security Token
Service
Cloudwatch Cloudtrail KMS
Catalog & Search
Access & Search Metadata
DynamoDB Amazon ESGlue Catalog
Access & User Interface
Give your users easy & secure access
API Gateway IAM Cognito
32. Thank You !
unni_k_pillai
Take the demo home
http://bit.ly/sg-summit-datalake-demo
Survey
&
Feedback
#AWSSummitSG
#NoMoreServers
#DataLake