Owning Your Own (Data) Lake House

Owning Your Own
(Data) Lake House
Antje Barth
Sr. Developer Advocate AI/ML
Amazon Web Services
@anbarth
https://www.datascienceonaws.com/

About me
Sr. Developer Advocate
for AI and Machine Learning
Co-author of the O'Reilly Book, "Data
Science on AWS."
Co-founder of the Düsseldorf
chapter of Women in Big Data.
@anbarth
Antje Barth

Agenda
• Data trends
• Building your (data) lake house
• Demo: Amazon Redshift Spectrum

Data trends
Growing
exponentially
From new
sources
Increasingly
diverse
Used by
many people
Analyzed by
many applications

Data silos to
OLTP ERP CRM LOB
DW Silo 1
Business
Intelligence
Devices Web Sensors Social
DW Silo 2
Business
Intelligence Machine
learning
BI +
analytics
Data
warehousing
Data lakes
Open formats
Central catalog
Traditional data warehousing approaches don’t scale

Companies moving to data lake architectures
Bringing together the best of both worlds
Extends or evolves DW architectures
Store any data in any format
Durable, available, and exabyte scale
Secure, compliant, auditable
Run any type of analytics from DW to Predictive
Data
Warehousing
Analytics Machine
Learning
Data lake

The new data analytics stack
Migration & Streaming Services
Infrastructure Data Catalog
& ETL
Security &
Management
Data
Warehousing
Big Data
Processing
Interactive
Query
Operational
Analytics
Real time
Analytics
Serverless
Data processing
Data movement
Analytics
Data lake infrastructure & management
Dashboards Predictive Analytics
Data, visualization, engagement, & machine learning
Digital User EngagementData

Data movement
Analytics
Data lake infrastructure & management
Data, visualization, engagement, & machine learning
The AWS data analytics portfolio
+ many more
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark &
Python)
S3/Glacier AWS GlueLake
Formation
QuickSight SageMaker Comprehend Lex Polly Rekognition Translate
Database Migration Service | Snowball | Snowmobile | Kinesis Data Streams | Kinesis Data Firehose | Managed Streaming for Apache Kafka
PinpointData
Exchange

Open standards, formats, and Apache open source
Flink
Ganglia
Hbase
HCatalog
HDFS
Hive
Hudi
Java
JupyterHub
Kafka
Livy
Mahout
MapReduce
MxNET
MySQL
Oozie
ORC
Parquet
Phoenix
Pig
Presto
Python
PyTorch
R
Scala
Spark
Sqoop
SQL
TensorFlow
Tez
YARN
Zeppelin
Zookeeper

Data warehousing: Amazon Redshift
Best performance,
most scalable
3x faster with RA3*
10x faster with AQUA*
Adds unlimited compute capacity
on-demand to meet unlimited
concurrent access
Lowest cost
Cost-optimized workloads
by paying compute and
storage separately
1/10th cost of Traditional
DW at $1000/TB/year
Up to 75% less than other
cloud data warehouses &
predictable costs
Data lake &
AWS integration
Analyze exabytes of data across
data warehouse, data lakes, and
operational database
Query data across various
analytics services
Most secure
& compliant
AWS-grade security (eg. VPC,
encryption with KMS, CloudTrail)
All major certifications such
as SOC, PCI, DSS, ISO,
FedRAMP, HIPPA
First and most popular cloud data warehouse
*vs other cloud DWs

Redshift enables you to have a lake house approach
Companies moving to data lake architectures
Data warehouse
(business data)
Data lake
(event data)
Redshift

• Have one foot in a data warehouse, and one foot in a data lake
• Store highly structured, frequently accessed data in Redshift
• Keep exabytes of structured and unstructured data in S3
• Query seamlessly across both to provide unique insights that you
would not be able to obtain by querying independent datasets
Without moving or transforming data!
Redshift Lake House

Redshift Lake House Architecture
Powered by the following capabilities:
Amazon Redshift Spectrum
Query open format data directly in
the Amazon S3 data lake without
having to load the data or
duplicating your infrastructure.
Data Lake Export
Save the results of an Amazon Redshift
query directly to your S3 data lake in
an open file format (Apache Parquet)
using Data Lake Export.
Federated Query
Federated Query enables
Amazon Redshift to query data
directly in Amazon RDS and
Aurora PostgreSQL stores.

Redshift Spectrum is a feature of Redshift that allows Redshift SQL queries to
reference external data on Amazon S3 as they would any other table in Amazon
Redshift
RedshiftSpectrum
• Allows for querying of potentially exabytes of
data in an S3 data lake from within Amazon
Redshift
• Data is queried in-place, so no loading of data
into your Redshift cluster is required
• Keeps your data warehouse lean by ingesting
warm data locally while keeping other data in
the data lake within reach
• Write query results from Redshift direct to S3
external tables
• Powered by a separate fleet of powerful Amazon
Redshift Spectrum nodes

Queries on RDS and Aurora
PostgreSQL databases
Analytics on live data without data movement
Unified analytics across data warehouse,
data lake & operational databases
Flexible and easy way to ingest data
Performant and secure access to data RDS
PostgreSQL
Aurora
PostgreSQL
S3 Data lake
JDBC/ODBC
Amazon Redshift
Amazon Redshift Federated Query

Parquet is efficient open columnar
storage format for analytics
Analyze your data with Redshift Spectrum
and other AWS services such as Amazon Athena
and Amazon EMR
Amazon EMR
Amazon
Redshift
Amazon
Athena
Amazon S3
AWS Glue
UNLOAD
(‘select * from lineitem’)
TO
‘s3://mybucket/unload/lineitem/’
FORMAT as PARQUET
PARTITION BY (l_shipdate);
Data lake export: share data in Parquet format

Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
...
Amazon S3
Exabyte-scale object storage
Data Catalog
Glue / Apache Hive Metastore
1
Query is optimized and compiled using ML at
the leader node. Determine what gets run
locally and what goes to Amazon Redshift
Spectrum.
2
Query plan is sent to
all compute nodes
3
Compute nodes obtain partition info
from Data Catalog; dynamically prune
partitions
4
Each compute node issues multiple
requests to the Amazon Redshift
Spectrum layer
5
Amazon Redshift Spectrum nodes
scan your S3 data
6
7
Amazon Redshift
Spectrum projects,
filters, joins and
aggregates
Final aggregations and joins with
local Amazon Redshift tables
done in-cluster
8
Result is sent back to client9 Leader Node
Compute
Nodes
10 GigE
(HPC)
JDBC/ODBC
SQL Clients /
BI Tools
Redshift Spectrum Fleet
Life of a query

https://github.com/data-science-on-aws/workshop

Amazon Redshift
• Fastest query performance
• Complex SQL queries with multiple joins and sub-queries
• Querying an S3 data lake & joins between S3 data with local
cluster data
Amazon EMR
• Simple & cost effective to run Hadoop, Spark, & Presto
• Run custom applications and code
• Define specific compute, memory, storage, and application
parameters to optimize your analytic requirements
Amazon Athena
• Run data exploration and discovery queries
• Analytical queries on data lakes, geospatial data, and service logs
• No need to setup or manage any servers
Data Lake Query Services: How to choose?

Getting Started
• O’Reilly Book - Data Science on AWS – Early Release Available!
https://datascienceonaws.com
• GitHub Repo
https://github.com/data-science-on-aws/workshop
• Get Started with AWS and Amazon Redshift
http://aws.amazon.com/free
https://docs.aws.amazon.com/redshift/
https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html

Thank you!
Antje Barth
@anbarth
linkedin.com/in/antje-barth/

Owning Your Own (Data) Lake House

Related slideshows

More Related Content

Owning Your Own (Data) Lake House