Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Owning Your Own
(Data) Lake House
Antje Barth
Sr. Developer Advocate AI/ML
Amazon Web Services
@anbarth
https://www.datascienceonaws.com/
About me
Sr. Developer Advocate
for AI and Machine Learning
Co-author of the O'Reilly Book, "Data
Science on AWS."
Co-founder of the Düsseldorf
chapter of Women in Big Data.
https://www.datascienceonaws.com/
@anbarth
Antje Barth
Agenda
• Data trends
• Building your (data) lake house
• Demo: Amazon Redshift Spectrum
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data trends
Growing
exponentially
From new
sources
Increasingly
diverse
Used by
many people
Analyzed by
many applications
Data silos to
OLTP ERP CRM LOB
DW Silo 1
Business
Intelligence
Devices Web Sensors Social
DW Silo 2
Business
Intelligence Machine
learning
BI +
analytics
Data
warehousing
Data lakes
Open formats
Central catalog
Traditional data warehousing approaches don’t scale
Companies moving to data lake architectures
Bringing together the best of both worlds
Extends or evolves DW architectures
Store any data in any format
Durable, available, and exabyte scale
Secure, compliant, auditable
Run any type of analytics from DW to Predictive
Data
Warehousing
Analytics Machine
Learning
Data lake
The new data analytics stack
Migration & Streaming Services
Infrastructure Data Catalog
& ETL
Security &
Management
Data
Warehousing
Big Data
Processing
Interactive
Query
Operational
Analytics
Real time
Analytics
Serverless
Data processing
Data movement
Analytics
Data lake infrastructure & management
Dashboards Predictive Analytics
Data, visualization, engagement, & machine learning
Digital User EngagementData
Data movement
Analytics
Data lake infrastructure & management
Data, visualization, engagement, & machine learning
The AWS data analytics portfolio
+ many more
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark &
Python)
S3/Glacier AWS GlueLake
Formation
QuickSight SageMaker Comprehend Lex Polly Rekognition Translate
Database Migration Service | Snowball | Snowmobile | Kinesis Data Streams | Kinesis Data Firehose | Managed Streaming for Apache Kafka
PinpointData
Exchange
Open standards, formats, and Apache open source
Flink
Ganglia
Hbase
HCatalog
HDFS
Hive
Hudi
Java
JupyterHub
Kafka
Livy
Mahout
MapReduce
MxNET
MySQL
Oozie
ORC
Parquet
Phoenix
Pig
Presto
Python
PyTorch
R
Scala
Spark
Sqoop
SQL
TensorFlow
Tez
YARN
Zeppelin
Zookeeper
Data warehousing: Amazon Redshift
Best performance,
most scalable
3x faster with RA3*
10x faster with AQUA*
Adds unlimited compute capacity
on-demand to meet unlimited
concurrent access
Lowest cost
Cost-optimized workloads
by paying compute and
storage separately
1/10th cost of Traditional
DW at $1000/TB/year
Up to 75% less than other
cloud data warehouses &
predictable costs
Data lake &
AWS integration
Analyze exabytes of data across
data warehouse, data lakes, and
operational database
Query data across various
analytics services
Most secure
& compliant
AWS-grade security (eg. VPC,
encryption with KMS, CloudTrail)
All major certifications such
as SOC, PCI, DSS, ISO,
FedRAMP, HIPPA
First and most popular cloud data warehouse
*vs other cloud DWs
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Redshift enables you to have a lake house approach
Companies moving to data lake architectures
Data warehouse
(business data)
Data lake
(event data)
Redshift
• Have one foot in a data warehouse, and one foot in a data lake
• Store highly structured, frequently accessed data in Redshift
• Keep exabytes of structured and unstructured data in S3
• Query seamlessly across both to provide unique insights that you
would not be able to obtain by querying independent datasets
Without moving or transforming data!
Redshift Lake House
Redshift Lake House Architecture
Powered by the following capabilities:
Amazon Redshift Spectrum
Query open format data directly in
the Amazon S3 data lake without
having to load the data or
duplicating your infrastructure.
Data Lake Export
Save the results of an Amazon Redshift
query directly to your S3 data lake in
an open file format (Apache Parquet)
using Data Lake Export.
Federated Query
Federated Query enables
Amazon Redshift to query data
directly in Amazon RDS and
Aurora PostgreSQL stores.
Redshift Spectrum is a feature of Redshift that allows Redshift SQL queries to
reference external data on Amazon S3 as they would any other table in Amazon
Redshift
RedshiftSpectrum
• Allows for querying of potentially exabytes of
data in an S3 data lake from within Amazon
Redshift
• Data is queried in-place, so no loading of data
into your Redshift cluster is required
• Keeps your data warehouse lean by ingesting
warm data locally while keeping other data in
the data lake within reach
• Write query results from Redshift direct to S3
external tables
• Powered by a separate fleet of powerful Amazon
Redshift Spectrum nodes
Queries on RDS and Aurora
PostgreSQL databases
Analytics on live data without data movement
Unified analytics across data warehouse,
data lake & operational databases
Flexible and easy way to ingest data
Performant and secure access to data RDS
PostgreSQL
Aurora
PostgreSQL
S3 Data lake
JDBC/ODBC
Amazon Redshift
Amazon Redshift Federated Query
Parquet is efficient open columnar
storage format for analytics
Analyze your data with Redshift Spectrum
and other AWS services such as Amazon Athena
and Amazon EMR
Amazon EMR
Amazon
Redshift
Amazon
Athena
Amazon S3
AWS Glue
UNLOAD
(‘select * from lineitem’)
TO
‘s3://mybucket/unload/lineitem/’
FORMAT as PARQUET
PARTITION BY (l_shipdate);
Data lake export: share data in Parquet format
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
...
Amazon S3
Exabyte-scale object storage
Data Catalog
Glue / Apache Hive Metastore
1
Query is optimized and compiled using ML at
the leader node. Determine what gets run
locally and what goes to Amazon Redshift
Spectrum.
2
Query plan is sent to
all compute nodes
3
Compute nodes obtain partition info
from Data Catalog; dynamically prune
partitions
4
Each compute node issues multiple
requests to the Amazon Redshift
Spectrum layer
5
Amazon Redshift Spectrum nodes
scan your S3 data
6
7
Amazon Redshift
Spectrum projects,
filters, joins and
aggregates
Final aggregations and joins with
local Amazon Redshift tables
done in-cluster
8
Result is sent back to client9 Leader Node
Compute
Nodes
10 GigE
(HPC)
JDBC/ODBC
SQL Clients /
BI Tools
Redshift Spectrum Fleet
Life of a query
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://github.com/data-science-on-aws/workshop
Amazon Redshift
• Fastest query performance
• Complex SQL queries with multiple joins and sub-queries
• Querying an S3 data lake & joins between S3 data with local
cluster data
Amazon EMR
• Simple & cost effective to run Hadoop, Spark, & Presto
• Run custom applications and code
• Define specific compute, memory, storage, and application
parameters to optimize your analytic requirements
Amazon Athena
• Run data exploration and discovery queries
• Analytical queries on data lakes, geospatial data, and service logs
• No need to setup or manage any servers
Data Lake Query Services: How to choose?
Getting Started
• O’Reilly Book - Data Science on AWS – Early Release Available!
https://datascienceonaws.com
• GitHub Repo
https://github.com/data-science-on-aws/workshop
• Get Started with AWS and Amazon Redshift
http://aws.amazon.com/free
https://docs.aws.amazon.com/redshift/
https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html
Thank you!
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Antje Barth
@anbarth
linkedin.com/in/antje-barth/
https://www.datascienceonaws.com/

More Related Content

Owning Your Own (Data) Lake House

  • 1. Owning Your Own (Data) Lake House Antje Barth Sr. Developer Advocate AI/ML Amazon Web Services @anbarth https://www.datascienceonaws.com/
  • 2. About me Sr. Developer Advocate for AI and Machine Learning Co-author of the O'Reilly Book, "Data Science on AWS." Co-founder of the Düsseldorf chapter of Women in Big Data. https://www.datascienceonaws.com/ @anbarth Antje Barth
  • 3. Agenda • Data trends • Building your (data) lake house • Demo: Amazon Redshift Spectrum
  • 4. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 6. Data silos to OLTP ERP CRM LOB DW Silo 1 Business Intelligence Devices Web Sensors Social DW Silo 2 Business Intelligence Machine learning BI + analytics Data warehousing Data lakes Open formats Central catalog Traditional data warehousing approaches don’t scale
  • 7. Companies moving to data lake architectures Bringing together the best of both worlds Extends or evolves DW architectures Store any data in any format Durable, available, and exabyte scale Secure, compliant, auditable Run any type of analytics from DW to Predictive Data Warehousing Analytics Machine Learning Data lake
  • 8. The new data analytics stack Migration & Streaming Services Infrastructure Data Catalog & ETL Security & Management Data Warehousing Big Data Processing Interactive Query Operational Analytics Real time Analytics Serverless Data processing Data movement Analytics Data lake infrastructure & management Dashboards Predictive Analytics Data, visualization, engagement, & machine learning Digital User EngagementData
  • 9. Data movement Analytics Data lake infrastructure & management Data, visualization, engagement, & machine learning The AWS data analytics portfolio + many more Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics AWS Glue (Spark & Python) S3/Glacier AWS GlueLake Formation QuickSight SageMaker Comprehend Lex Polly Rekognition Translate Database Migration Service | Snowball | Snowmobile | Kinesis Data Streams | Kinesis Data Firehose | Managed Streaming for Apache Kafka PinpointData Exchange
  • 10. Open standards, formats, and Apache open source Flink Ganglia Hbase HCatalog HDFS Hive Hudi Java JupyterHub Kafka Livy Mahout MapReduce MxNET MySQL Oozie ORC Parquet Phoenix Pig Presto Python PyTorch R Scala Spark Sqoop SQL TensorFlow Tez YARN Zeppelin Zookeeper
  • 11. Data warehousing: Amazon Redshift Best performance, most scalable 3x faster with RA3* 10x faster with AQUA* Adds unlimited compute capacity on-demand to meet unlimited concurrent access Lowest cost Cost-optimized workloads by paying compute and storage separately 1/10th cost of Traditional DW at $1000/TB/year Up to 75% less than other cloud data warehouses & predictable costs Data lake & AWS integration Analyze exabytes of data across data warehouse, data lakes, and operational database Query data across various analytics services Most secure & compliant AWS-grade security (eg. VPC, encryption with KMS, CloudTrail) All major certifications such as SOC, PCI, DSS, ISO, FedRAMP, HIPPA First and most popular cloud data warehouse *vs other cloud DWs
  • 12. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 13. Redshift enables you to have a lake house approach Companies moving to data lake architectures Data warehouse (business data) Data lake (event data) Redshift
  • 14. • Have one foot in a data warehouse, and one foot in a data lake • Store highly structured, frequently accessed data in Redshift • Keep exabytes of structured and unstructured data in S3 • Query seamlessly across both to provide unique insights that you would not be able to obtain by querying independent datasets Without moving or transforming data! Redshift Lake House
  • 15. Redshift Lake House Architecture Powered by the following capabilities: Amazon Redshift Spectrum Query open format data directly in the Amazon S3 data lake without having to load the data or duplicating your infrastructure. Data Lake Export Save the results of an Amazon Redshift query directly to your S3 data lake in an open file format (Apache Parquet) using Data Lake Export. Federated Query Federated Query enables Amazon Redshift to query data directly in Amazon RDS and Aurora PostgreSQL stores.
  • 16. Redshift Spectrum is a feature of Redshift that allows Redshift SQL queries to reference external data on Amazon S3 as they would any other table in Amazon Redshift RedshiftSpectrum • Allows for querying of potentially exabytes of data in an S3 data lake from within Amazon Redshift • Data is queried in-place, so no loading of data into your Redshift cluster is required • Keeps your data warehouse lean by ingesting warm data locally while keeping other data in the data lake within reach • Write query results from Redshift direct to S3 external tables • Powered by a separate fleet of powerful Amazon Redshift Spectrum nodes
  • 17. Queries on RDS and Aurora PostgreSQL databases Analytics on live data without data movement Unified analytics across data warehouse, data lake & operational databases Flexible and easy way to ingest data Performant and secure access to data RDS PostgreSQL Aurora PostgreSQL S3 Data lake JDBC/ODBC Amazon Redshift Amazon Redshift Federated Query
  • 18. Parquet is efficient open columnar storage format for analytics Analyze your data with Redshift Spectrum and other AWS services such as Amazon Athena and Amazon EMR Amazon EMR Amazon Redshift Amazon Athena Amazon S3 AWS Glue UNLOAD (‘select * from lineitem’) TO ‘s3://mybucket/unload/lineitem/’ FORMAT as PARQUET PARTITION BY (l_shipdate); Data lake export: share data in Parquet format
  • 19. Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY… ... Amazon S3 Exabyte-scale object storage Data Catalog Glue / Apache Hive Metastore 1 Query is optimized and compiled using ML at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum. 2 Query plan is sent to all compute nodes 3 Compute nodes obtain partition info from Data Catalog; dynamically prune partitions 4 Each compute node issues multiple requests to the Amazon Redshift Spectrum layer 5 Amazon Redshift Spectrum nodes scan your S3 data 6 7 Amazon Redshift Spectrum projects, filters, joins and aggregates Final aggregations and joins with local Amazon Redshift tables done in-cluster 8 Result is sent back to client9 Leader Node Compute Nodes 10 GigE (HPC) JDBC/ODBC SQL Clients / BI Tools Redshift Spectrum Fleet Life of a query
  • 20. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://github.com/data-science-on-aws/workshop
  • 21. Amazon Redshift • Fastest query performance • Complex SQL queries with multiple joins and sub-queries • Querying an S3 data lake & joins between S3 data with local cluster data Amazon EMR • Simple & cost effective to run Hadoop, Spark, & Presto • Run custom applications and code • Define specific compute, memory, storage, and application parameters to optimize your analytic requirements Amazon Athena • Run data exploration and discovery queries • Analytical queries on data lakes, geospatial data, and service logs • No need to setup or manage any servers Data Lake Query Services: How to choose?
  • 22. Getting Started • O’Reilly Book - Data Science on AWS – Early Release Available! https://datascienceonaws.com • GitHub Repo https://github.com/data-science-on-aws/workshop • Get Started with AWS and Amazon Redshift http://aws.amazon.com/free https://docs.aws.amazon.com/redshift/ https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html
  • 23. Thank you! © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Antje Barth @anbarth linkedin.com/in/antje-barth/ https://www.datascienceonaws.com/