The Marriage of the Data Lake and the Data Warehouse and Why You Need Both

1
todd.reichmuth@snowflake.com Solution Engineering
Data Marriage of the Lake and Warehouse
(why you need both)

2
Data Lake
a place to store ALL (not only structured) data in native format that can inexpensively grow without limits and
can exist until the end of time
Data Warehouse
a place to store “enhanced” data that is integrated, governed, cleansed, and well defined. Typically the data is
multi-dimensional, historical, and non-volatile.
Data Mart
a subset of the data warehouse, oriented to a specific business line or team.
Big Data
Data characterized by the 3 (sometimes up to 5) V’s.
Volume (high), Velocity (near real time), Variety (everything), Veracity (trustworthy), Value (actionable insights)
Usually satisfied by Data Lake, sometimes by Data Warehouse
Let’s agree on terminology
2

3
In the beginning, ”analytics” were run against operational systems. This led to single purpose
decision support systems. It became apparent that this was not sustainable. Siloed
environments caused multiple interpretations of the truth, duplicated effort, high cost, and slow
time to value.
An enterprise solution was needed.
• 1979 – Bill Inmon introduces the term Data Warehouse
• 1983 – Teradata introduces DBC/1012 specifically designed for decision support
• 1986 – Ralph Kimball founds Red Brick Systems
• 1988 – IBM publishes an article introducing the term “Business Data Warehouse”
• 1992 – Bill Inmon publishes a book ”Building the Data Warehouse”
• 1996 – Ralph Kimball publishes a book “The Data Warehouse Toolkit”
Evolution of the Data warehouse

4
Data Warehouse architecture in the early days
OLTP
System
ERP
CRM
Data
Warehouse
Flat
Files
ETL
OLAP
Reporting
Mining

5
Data warehouse architecture evolution
OLTP
System
ERP
CRM
Data
Warehouse
Flat
Files
ETL
Dashboarding
Reporting
Analytics
Data Science
External Users
Data
Warehouse
???
Data
Warehouse
???
Web &
Mobile
External
Sources
IoT
ELT
Stream
MobileXML / JSON
High Volume
Low Latency
Data velocity increasing (machine generated + IoT) and in new shapes and sizes. Silo’ing of
data came back in different ways. Less tolerance for long analytical cycles. The existing
data warehouse could not support this evolution.
Structured
High Performance

6
2010 – James Dixon
CTO Pentaho
“If you think of a
datamart as a store of
bottled water –
cleansed and packaged
and structured for easy
consumption – the data
lake is a large body of
water in a more natural
state. The contents of
the data lake stream in
from a source to fill the
lake, and various users
of the lake can come to
examine, dive in, or
take samples.”
Emergence of the Data lake

7
Big Data V’s: Warehouse & Lake comparison
7
Data warehouse Data lake
Volume
Expensive for large data,
capacity is often capped
Low cost and high capacity
Variety
Structured
Schema on write
Structured, semi-structured,
unstructured
Schema on read
Velocity
Batch ETL/ELT
Hourly or daily latency
Streaming
Near real time
Veracity
Highly curated, governed.
Source of truth
Raw, uncleansed data.
Source of fact
Value
Measure things, detect trends
Q: What has worked well?
Find relationships, make
predictions
Q: What will work well?

8
Absolutely not (in most cases) !!!
They each fulfill a different purpose. Each is optimized for different users,
skills, workloads, and service level agreements.
The data lake is an “exploratory” layer with low data latency and no
preconceived boundaries on how data will be used. Perfect for data scientists
and those willing to do their own data preparation and discovery.
A data warehouse is a trusted and curated layer for making “bet your career”
decisions. Loved by business professionals. Built for easy consumption and
ready to publish.
Can the data lake eliminate the warehouse?

9
• Certain users need/want data in a way the warehouse can best provide
• Integrated
• Subject oriented
• Curated
• Time variant
• Relational
• Ready to consume
• Data may need to be enhanced to support particular user needs
• Some business functions will require high performance, high concurrency
• Enterprise level historical data needs to be carefully managed
Why a Lake should not replace a Warehouse

10
Modern data management landscape
OLTP
System
ERP
CRM
Data
Warehouse
Flat
Files
ETL
Dashboarding
Reporting
Analytics
Data Science
External Users
Data Lake
Web &
Mobile
External
Sources
IoT
ELT
Stream
Mobile
Fill the lake with all data and curate to the warehouse to satisfy known requirements

11
Building the modern data management environment
Start filling the lake
Low cost pure
capture
environment
Built separate from
core IT
Data is stored raw
Stage 1
Landing zone for
raw data
Stage 2
Data science
environment
Stage 3
Offload for data
warehouse
Stage 4
Key component of
data operations
Actively used as
platform for
experimentation
Test and learn
environment
Data scientist given
access
Governance = “just
enough”
Large datasets, low
intensity workload
moved to lake
Needle in the
haystack searches
that do not require
indexes
Lake becomes a
source for EDW
Lake becomes core
part of data
management
Attempt to enforce
strong governance
and tight security

12
Data Warehouse
• SMP or MPP
• Popularity of the appliance
• In some cases Hadoop
• Oracle, Teradata, Netezza, Vertica, Redshift,
etc.
Data Lake
• File system with abstraction
• Hadoop, Amazon S3, Azure Data Lake Storage,
Google Cloud Storage
Technology tools of choice
1
Can a single technology platform fulfill the
needs of both data management frameworks?

13
Lake
• Cheap and plentiful storage
• Ability to handle all data types
• Optimized for ingestion
Warehouse
• Supports commonly used tools
• Handles high concurrency
• Reliable, Available, Stable
• Data protection (secure & recoverable)
Desirable capabilities
• Low administration
• Support for multi-tenant/concurrency
• Easily scale up/down, in/out
• Low total cost of ownership
• No paying for unused capacity
• Sharing of data with external entities
Technical requirements to satisfy both frameworks

1414
Modern data management architecture
SMP Shared-disk MPP Shared-nothing
Single Server
Resizing cluster requires redistributing data.
Impact to user.
Reads/Writes at the same time
cripples the system
Each cluster requires its own copy of data
(ex: test/dev, HA)
Replication requires
additional hardware
Processes needed to maintain
sort and distribution for performance
Resizing cluster requires redistributing data.
Impact to user.
Traditional architectures
© 2018 Snowflake Computing Inc. All Rights Reserved.
Cloud based, Multi-cluster,
shared data
Centralized, elastic storage
Multiple, Independent compute clusters
Unlimited storage and compute, scaled
independently gives flexibility

15
Cloud-based architecture for big data analytics
Data Warehousing Cloud Service
Separated Storage, Compute, &
Metadata, Global Services
One Compute cluster,
multiple Databases
One Database, multiple
Compute clusters
Compute clusters
scale independently
Multi-Cluster compute farms
support high concurrency
Workloads don't compete
with each other
Compute
cluster
Compute
cluster
Compute
cluster
Compute
cluster
Compute
cluster
S
S
M
XXLL
Database
Compute
cluster
Sales
Users
Data
Scientist
Finance
Users
ETL &
ELT
Services
Test/Dev
Users
Marketing
Users
XL
s
Metadata
Azure Data
Factory

1616
Relational database extended to semi-structured data
> SELECT … FROM …Semi-structured data
(JSON, Avro, XML, Parquet, ORC)
Structured data
(e.g., CSV, TSV, …)
Storage optimization
Transparent discovery and storage optimization of
repeated elements
Query optimization
Full database optimization for queries
on semi-structured data
+

1717© 2018 Snowflake Computing Inc. All Rights Reserved.
Streaming and Batch Loading
Data Sources
Web Servers
Cloud Storage
On Premises
Data Lake
BI Tools
Apps
Source Raw Data
Augmented Data
Snowflake
Database
S3
S3
Transformation
Transformation
Kafka/Kinesis
COPY
PUT + COPY
Snowpipe
REST API
Azure
Blob
GCS
Azure
Blob

18
A better way to share data
Data Providers Data Consumers
No data movement
Share with unlimited number
of consumers Live access
Data consumers immediately see
all updates
Ready to use
Consumers can immediately
start querying

1919
Near Zero Administration
NO Infrastructure
NO Tuning
NO Upgrades/Patches
NO Indexing
NO Storage worries
NO Reorg/Vacuuming
NO Partitioning
NO Stats gathering
NO Workload mgmt.
NO Manual backups

2020
Comprehensive data protection
Protection against infrastructure failures
All data transparently & synchronously replicated 3+ ways
across multiple datacenters
Long-term data protection
Zero-copy clones + optional export to S3 enable
user-managed data copies
Protection against corruption & user errors
“Time travel” feature enables instant roll-back to any point in time
during chosen retention window
New data Modified data
T0 T1 T2
SELECT * FROM T0…
Daily
Weekly

2121© 2018 Snowflake Computing Inc. All Rights Reserved.
“Time travel” for data
Previous versions of data
automatically retained
Retention period selected by customer
Accessed via SQL extensions
AS OF for selection
CLONE to recreate
UNDROP recovers from accidental deletion
New data Modified data
T0 T1 T2
> SELECT * FROM mytable AS OF T0
> SELECT * FROM mytable AS OF T1

2222
Zero-copy data cloning
Instant data cloning operations
Databases, schema, tables, etc
Metadata-only operation
Modified data stored as new blocks
Unmodified data stored only once
No data copying required, no cost!
Instant test/dev environments
Test code on your entire production dataset
Swap tables into production when ready
CLONE 1
MODIFIED DATA

23
Data Science
Reporting
Ad-hoc analytics
ETL and
Processing
Pay for what you actually use…down to the second
Morning Noon Night
Workload
Morning Noon Night
Workload
Morning Noon Night
Workload
Morning Noon Night
Workload
S 2S
Autoscaling Multi-cluster Warehouse
L
Autosuspend/Autoresume
M
Autosuspend/Autoresume
M 2M
Autoscaling Multi-cluster Warehouse

2424
Access control
Role-based access
control model
Granular privileges on
all objects & actions
Authentication
Embedded
multi-factor authentication
Federated
authentication available
Data encryption
All data encrypted,
always, end-to-end
Encryption keys
managed automatically
External validation
Certified against
enterprise-class requirements
Protected by industrial-strength security

25
Big Data V’s: Warehouse & Lake comparison
2
Data warehouse Data lake
Volume
Expensive for large data,
capacity is often capped
Low cost and high capacity
Variety
Structured
Schema on write
Structured, semi-structured,
unstructured
Schema on read
Velocity
Batch ETL/ELT
Hourly or daily latency
Streaming
Near real time
Veracity
Highly curated, governed.
Source of truth
Raw, uncleansed data.
Source of fact
Value
Measure things, detect trends
Q: What has worked well?
Find relationships, make
predictions
Q: What will work well?

26
Data Management Modernization using the Cloud
Modern Data
Sources
Streaming
Data
Platforms
Compute for Data
Scientists
Compute for Ad
Hoc Users
Compute for BI
& Analytics
Data lake
Data warehouse
Native JDBC /
ODBC
Connector
Web UI
SQL Analysts
Data science
users and tools
Native JDBC /
ODBC
Connector
BI & analytics tools
Compute for Data
Loading
Compute for Data
Transformations
Blob
Extract, Load & Transform
Tools (ELT)
Extract, Transform & Load
Tools (ETL)
Traditional
Data Sources
Key
NO ETL
ETL / ELT
Relational
Databases
Database Replication
Compute for loading
Transformed Data

Thank you!

The Marriage of the Data Lake and the Data Warehouse and Why You Need Both

More Related Content

The Marriage of the Data Lake and the Data Warehouse and Why You Need Both