Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unni Pillai
Specialist Solutions Architect, Amazon Web Services
Daniel Muller
Head of Cloud Infrastructure, Spuul
Architecting a Serverless
Data Lake on AWS
What is a Data Lake ?
A data lake is an architectural
approach that allows you to
store massive amounts of data
into a central location, so it's
readily available to be
categorized, processed,
analyzed and consumed by
diverse group of users within
an organization.
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
100110000100101011100
101010111001010100001
011111011010
0011110010`110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time
Challenges faced by data teams
Transactions
ERP
Sensor Data
Billing
Web Logs
Social
Infrastructure Logs
E x p o n e n t i a l g r o w t h i n d a t a D i v e r s i f i e d c o n s u m e r s
Data Scientists
Business Analyst External Consumers
Applications
M u l t i p l e a c c e s s m e c h a n i s m s
API Access
BI Tools
Notebooks
Characteristics of a data lake
Future
Proof
Flexible
Access
Dive in
Anywhere
Collect
Anything
Let’s take an example
Record-level dataDevice Data
Serverless
Ingest & store data in real-time
Discover and catalog data stored in the lake
Enable batch & real-time processing
Consume raw & processed data
Scalable, Highly Available, Pay what you use
Design Outcomes
Demo - Architecture
Data Source Ingest & Store Catalog &
Transform
Analyze &
Consume
Device 1
Device 100
.
.
.
.
.
.
Data Lake Ingestion
Multiple data lake ingestion methods
AWS Snowball and AWS Snowmobile
• PB-scale migration
AWS Storage Gateway
• Migrate legacy files
Native/ISV Connectors
• Ecosystem integration
Amazon S3 Transfer Acceleration
• Long-distance data transfer
AWS Direct Connect
• On-premises integration
Amazon Kinesis Firehose
• Ingest device streams
• Transform and store on
Amazon S3
Amazon Kinesis - real-time analytics
Easily collect, process, and analyze video and data streams in real time
Load data streams
into AWS data stores
Analyze data streams
with SQL
Build custom
applications that
analyze data streams
Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
Capture, process,
and store video
streams for analytics
Kinesis Video Streams
Serverless data delivery with Kinesis
Firehose
Firehose
Data Producers
Amazon S3
Amazon
Redshift
Amazon
Elasticsearch
Destinations
AWS Lambda
Inline
Transform
Demo - Architecture
Data Source Ingest & Store Catalog &
Transform
Analyze &
Consume
Kinesis FirehoseDevice 1
Device 100
.
.
.
.
.
.
Data Lake Storage
Unmatched durability,
availability, and scalability
Best security, compliance, and audit
capability
Object-level control
at any scale
Business insight into
your data
Twice as many partner
integrations
Most ways to bring
data in
Amazon S3 - Infinite, Durable & Cost Effective
Storage
Demo - Architecture
Data Source Ingest & Store Catalog &
Transform
Analyze &
Consume
Kinesis Firehose
raw
Raw Data S3 Bucket
Device 1
Device 100
.
.
.
.
.
.
Data Lake Metadata Management
Discover, Catalog & ETL
AWS Glue—data catalog
Make data discoverable
Automatically discovers data and stores schema
Catalog makes data searchable, and available for ETL
Catalog contains table and job definitions
Computes statistics to make queries efficient
Compliance
Glue
Data Catalog
Discover data and
extract schema
AWS Glue—ETL service
Make ETL scripting and deployment easy
Serverless Transformations
Based on Apache Spark
Automatically generates ETL code
Code is customizable with PySpark and Scala
Endpoints provided to edit, debug, test code
Jobs are scheduled or event-based
Demo - Architecture
Data Source Ingest & Store Catalog &
Transform
Analyze &
Consume
Processed S3 Bucket
Kinesis Firehose
raw
Raw Data S3 Bucket
Glue Crawler
Glue Catalog
Device 1
Device 100
.
.
.
.
.
.
Glue ETL
Data Lake Analytics
Amazon EMR - Big Data Processing
Fully managed – Hadoop Framework
19 Apps : Hadoop, Hive , Spark, HBase, Presto, and more
S3 Integration – Decouple Compute and Storage
Low Cost – Transient Clusters, Per Second Pricing, Spot Instances
Amazon Redshift - Modern Data Warehousing
Fast, scalable, fully managed EDW at 1/10th the cost of other EDWs
Massively parallel, scales from gigabytes to exabytes
Access data across your Redshift DW and Amazon S3 data lake
Amazon
Redshift
Amazon Athena—Interactive Analysis
$
SQL
Query Instantly
Zero setup cost; just
point to S3 and start
querying
Pay per query
Pay only for queries run;
save 30–90% on per-
query costs through
compression
Open
ANSI SQL interface,
JDBC/ODBC drivers, multiple
formats, compression types,
and complex joins and data
types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with QuickSight
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Demo - Architecture
Data Source Ingest & Store Catalog &
Transform
Analyze &
Consume
Processed S3 Bucket
Glue Catalog
Kinesis Firehose
raw
Raw Data S3 Bucket
Glue Crawler
Athena
Quick SightDevice 1
Device 100
.
.
.
.
.
.
Glue ETL
Daniel Muller
Head of Cloud Infrastructure, Spuul
danielmullerch
v Leading OTT player
v Indian Movies, Shows and Live TV
v 50 millions registered users
v Users in 5 continents
v Content served on Mobile,
connected TVs, SetTop Boxes
v (Coming Soon..!) Non-Indian content
Why we built a serverless data lake ?
• 100+ event types - across microservices & devices
• Flexibility - Ingestion, Consumption
• Bottomless storage - Cheap & Reliable
• Ad-hoc querying – Analyze data without a data-warehouse
• Future use cases - 3rd party integrations
• We hate managing servers..! #NoMoreServers
Architecture
Lessons Learnt & Best Practices
Use a framework - SAM, Serverless, Apex/Up
Store data in raw format - debugging and re-processing
Convert to Columnar Formats - Optimized for reads
Partition data - Based on your filters
Specify columns to load - Reduce data transfer
Create files of ~100MB - Reduces S3 list API calls
Compress Data in Lake – Reduce network transfers
Use Lambda for Automation – Wire things together
Summary
Demo - Architecture
Data Source Ingest & Store Catalog &
Transform
Analyze &
Consume
Processed S3 Bucket
Kinesis Firehose
raw
Raw Data S3 Bucket
Athena
Quick SightDevice 1
Device 100
.
.
.
.
.
.
Glue Catalog
Glue Crawler
Glue ETL
Take the demo home…
http://bit.ly/sg-summit-datalake-demo
Central Storage
Secure, Cost Effective
Storage in S3
S3
Kinesis Direct Connect Snowball DMS
Data Ingestion
Get your data into S3
quickly and securely
Athena Quicksight EMR Redshift
Processing & Analytics
Use predictive and prescriptive
analytics to gain better understanding
Glue ETL
Protect & Secure
Use entitlements to ensure data is secure and users identities are verified
Security Token
Service
Cloudwatch Cloudtrail KMS
Catalog & Search
Access & Search Metadata
DynamoDB Amazon ESGlue Catalog
Access & User Interface
Give your users easy & secure access
API Gateway IAM Cognito
Thank You !
unni_k_pillai
Take the demo home
http://bit.ly/sg-summit-datalake-demo
Survey
&
Feedback
#AWSSummitSG
#NoMoreServers
#DataLake

More Related Content

AWS Summit Singapore - Architecting a Serverless Data Lake on AWS

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Unni Pillai Specialist Solutions Architect, Amazon Web Services Daniel Muller Head of Cloud Infrastructure, Spuul Architecting a Serverless Data Lake on AWS
  • 2. What is a Data Lake ? A data lake is an architectural approach that allows you to store massive amounts of data into a central location, so it's readily available to be categorized, processed, analyzed and consumed by diverse group of users within an organization. OLTP ERP CRM LOB Data Warehouse Business Intelligence Data Lake 100110000100101011100 101010111001010100001 011111011010 0011110010`110010110 0100011000010 Devices Web Sensors Social Catalog Machine Learning DW Queries Big data processing Interactive Real-time
  • 3. Challenges faced by data teams Transactions ERP Sensor Data Billing Web Logs Social Infrastructure Logs E x p o n e n t i a l g r o w t h i n d a t a D i v e r s i f i e d c o n s u m e r s Data Scientists Business Analyst External Consumers Applications M u l t i p l e a c c e s s m e c h a n i s m s API Access BI Tools Notebooks
  • 4. Characteristics of a data lake Future Proof Flexible Access Dive in Anywhere Collect Anything
  • 5. Let’s take an example Record-level dataDevice Data Serverless Ingest & store data in real-time Discover and catalog data stored in the lake Enable batch & real-time processing Consume raw & processed data Scalable, Highly Available, Pay what you use Design Outcomes
  • 6. Demo - Architecture Data Source Ingest & Store Catalog & Transform Analyze & Consume Device 1 Device 100 . . . . . .
  • 8. Multiple data lake ingestion methods AWS Snowball and AWS Snowmobile • PB-scale migration AWS Storage Gateway • Migrate legacy files Native/ISV Connectors • Ecosystem integration Amazon S3 Transfer Acceleration • Long-distance data transfer AWS Direct Connect • On-premises integration Amazon Kinesis Firehose • Ingest device streams • Transform and store on Amazon S3
  • 9. Amazon Kinesis - real-time analytics Easily collect, process, and analyze video and data streams in real time Load data streams into AWS data stores Analyze data streams with SQL Build custom applications that analyze data streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics Capture, process, and store video streams for analytics Kinesis Video Streams
  • 10. Serverless data delivery with Kinesis Firehose Firehose Data Producers Amazon S3 Amazon Redshift Amazon Elasticsearch Destinations AWS Lambda Inline Transform
  • 11. Demo - Architecture Data Source Ingest & Store Catalog & Transform Analyze & Consume Kinesis FirehoseDevice 1 Device 100 . . . . . .
  • 13. Unmatched durability, availability, and scalability Best security, compliance, and audit capability Object-level control at any scale Business insight into your data Twice as many partner integrations Most ways to bring data in Amazon S3 - Infinite, Durable & Cost Effective Storage
  • 14. Demo - Architecture Data Source Ingest & Store Catalog & Transform Analyze & Consume Kinesis Firehose raw Raw Data S3 Bucket Device 1 Device 100 . . . . . .
  • 15. Data Lake Metadata Management Discover, Catalog & ETL
  • 16. AWS Glue—data catalog Make data discoverable Automatically discovers data and stores schema Catalog makes data searchable, and available for ETL Catalog contains table and job definitions Computes statistics to make queries efficient Compliance Glue Data Catalog Discover data and extract schema
  • 17. AWS Glue—ETL service Make ETL scripting and deployment easy Serverless Transformations Based on Apache Spark Automatically generates ETL code Code is customizable with PySpark and Scala Endpoints provided to edit, debug, test code Jobs are scheduled or event-based
  • 18. Demo - Architecture Data Source Ingest & Store Catalog & Transform Analyze & Consume Processed S3 Bucket Kinesis Firehose raw Raw Data S3 Bucket Glue Crawler Glue Catalog Device 1 Device 100 . . . . . . Glue ETL
  • 20. Amazon EMR - Big Data Processing Fully managed – Hadoop Framework 19 Apps : Hadoop, Hive , Spark, HBase, Presto, and more S3 Integration – Decouple Compute and Storage Low Cost – Transient Clusters, Per Second Pricing, Spot Instances Amazon Redshift - Modern Data Warehousing Fast, scalable, fully managed EDW at 1/10th the cost of other EDWs Massively parallel, scales from gigabytes to exabytes Access data across your Redshift DW and Amazon S3 data lake Amazon Redshift
  • 21. Amazon Athena—Interactive Analysis $ SQL Query Instantly Zero setup cost; just point to S3 and start querying Pay per query Pay only for queries run; save 30–90% on per- query costs through compression Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Easy Serverless: zero infrastructure, zero administration Integrated with QuickSight Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load
  • 22. Demo - Architecture Data Source Ingest & Store Catalog & Transform Analyze & Consume Processed S3 Bucket Glue Catalog Kinesis Firehose raw Raw Data S3 Bucket Glue Crawler Athena Quick SightDevice 1 Device 100 . . . . . . Glue ETL
  • 23. Daniel Muller Head of Cloud Infrastructure, Spuul danielmullerch
  • 24. v Leading OTT player v Indian Movies, Shows and Live TV v 50 millions registered users v Users in 5 continents v Content served on Mobile, connected TVs, SetTop Boxes v (Coming Soon..!) Non-Indian content
  • 25. Why we built a serverless data lake ? • 100+ event types - across microservices & devices • Flexibility - Ingestion, Consumption • Bottomless storage - Cheap & Reliable • Ad-hoc querying – Analyze data without a data-warehouse • Future use cases - 3rd party integrations • We hate managing servers..! #NoMoreServers
  • 27. Lessons Learnt & Best Practices Use a framework - SAM, Serverless, Apex/Up Store data in raw format - debugging and re-processing Convert to Columnar Formats - Optimized for reads Partition data - Based on your filters Specify columns to load - Reduce data transfer Create files of ~100MB - Reduces S3 list API calls Compress Data in Lake – Reduce network transfers Use Lambda for Automation – Wire things together
  • 29. Demo - Architecture Data Source Ingest & Store Catalog & Transform Analyze & Consume Processed S3 Bucket Kinesis Firehose raw Raw Data S3 Bucket Athena Quick SightDevice 1 Device 100 . . . . . . Glue Catalog Glue Crawler Glue ETL
  • 30. Take the demo home… http://bit.ly/sg-summit-datalake-demo
  • 31. Central Storage Secure, Cost Effective Storage in S3 S3 Kinesis Direct Connect Snowball DMS Data Ingestion Get your data into S3 quickly and securely Athena Quicksight EMR Redshift Processing & Analytics Use predictive and prescriptive analytics to gain better understanding Glue ETL Protect & Secure Use entitlements to ensure data is secure and users identities are verified Security Token Service Cloudwatch Cloudtrail KMS Catalog & Search Access & Search Metadata DynamoDB Amazon ESGlue Catalog Access & User Interface Give your users easy & secure access API Gateway IAM Cognito
  • 32. Thank You ! unni_k_pillai Take the demo home http://bit.ly/sg-summit-datalake-demo Survey & Feedback #AWSSummitSG #NoMoreServers #DataLake