Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Three Data Lake Blueprints: A Data Architects' Guide To Building A Data Lake For Success

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

A Data Architects’ Guide

To Building a Data Lake for Success:


THREE DATA LAKE BLUEPRINTS
Data lakes are a concept not a technology.
– Matt Aslett, Sr. Research Director, 451 Research

hvr-software.com | 1
hvr-software.com info@hvr-software.com | +1 415.489.3427 | @hvr_software
Contents

Introduction Data Lake Blueprints 3

Data Lake Blueprint 1 File System (HDFS, Amazon S3, Azure Data Lake Store, Google Cloud Storage) 6

Data Lake Blueprint 2 Streaming Pipeline 10

Data Lake Blueprint 3 Scale Out Database 13

Summary Blueprint Comparison 17

About HVR 19

HVR Live Demo 20

hvr-software.com | 2
Introduction:
DATA LAKE BLUEPRINTS

hvr-software.com | 3
An Introduction to Data Lake Blueprints
Over the past four years, we’ve been involved in a number of data lake projects. As a result of working with customers on this
common use case, we have learned a lot about what makes a data lake project successful. We have experience integrating data
into the different types of technologies that work well as a data lake, and understand the nuances of data integration for each.
As such, we see the following as the three common blueprints for data lakes: file systems, streaming pipelines, and scale out
databases.

In this e-book, we will describe each type of data lake, its pluses and minuses in order to help you determine the best option for
your project.

What is a Data Lake?


A data lake is a method of organizing large volumes of highly diverse data. Data is
typically stored in its raw, original format, and changes are made available as soon as
possible to enable quick data exploration.

Most data lakes store data from a variety of sources. Log files, sensors, and social
media are “additive” sources, which means they create only new data points, such
as log entries, device measurements, and social media posts. Traditional relational
database applications, which store mission-critical data, are usually non-additive
sources. While some RDBMS systems such as those that store manufacturing
measurements are additive, most also modify entries or delete rows. HVR’s log-based
Change Data Capture is typically used to capture changes from relational database
applications.

hvr-software.com | 4
The Blueprints

The lower-cost, more popular option for a data lake is a scalable file system like
Hadoop distributed file system (HDFS), Amazon S3, Azure Data Lake Store (ADLS),
or Google Cloud Storage (GCS). Users began experimenting with file system based
data lakes in 2015 and interest has been high since 2016.
FILE SYSTEMS

Today, we increasingly see organizations exploring data lakes for which the
initial destination is a streaming pipeline with short-term storage in a messaging
platform like Apache Kafka, Amazon Kinesis, or Azure Event Hub, and long-term
storage in another data store, such as S3 or HDFS, or even a relational database.
FILE STREAMING PIPELINE WITH KAFKA
AND KINESIS

HVR first began working with data lakes in 2014, using a scale out relational
database designed to handle exponential data growth. Customer benefits from
this platform include direct SQL access and fast performance. However, high
overall cost limits the number of inbound data streams and the maximum amount
of data a scale out database can realistically store. However, Snowflake has been
beating this trend in recent years, providing a cost-effective scale out database for
SCALE OUT DATABASES
both structured and unstructured data, with access through SQL. As a result, we’re
seeing more and more customers adopting Snowflake for their data lakes.

hvr-software.com | 5
Data Lake
Blueprint 1
FILE SYSTEM
(HDFS, S3, ADLS, GCS)
hvr-software.com | 6
Data Lake Blueprint 1: File System (HDFS, S3, ADLS, GCS)
Today’s most common data lake deployment strategy uses a file system such as HDFS, Amazon S3, ADLS, or GCS as the data
store. In this blueprint, HVR stores the transactions it captures as a sequence of changes on top of the initial data set.

Because a file system is the target and a traditional relational database is commonly at least one of the sources, transactional
consistency is a major challenge. Of course some use cases are more sensitive to transactional consistency than others. Imagine
orphaned order detail information or a financial transfer between two bank accounts where the customer record for the receiving
account cannot be found. Transactional consistency and referential integrity are often taken for granted in traditional relational
databases, but they cannot be ignored when pulling transactions apart and storing data in a file system.

SOURCES TARGETS

TRANSACTIONAL
CONSISTENCY

RELATIONAL DATABASES FILE SYSTEMS

hvr-software.com | 7
Organizations using HVR to implement data lakes on file systems leverage several
unique capabilities to manage the data, including trail of changes, data publication,
file management, HIVE external tables, and big data compare.

Trail of Changes
TIME MORE
Unlike a traditional relational database, data in a file system cannot be easily
OPERATION KEY VALUE OPERATION
KEY METADATA
changed. Instead, you store an append-only set of incremental changes that use the
Insert 1 One Insert T1 One

Insert 1 Two Update T2 Two row identifier to compute the current state of the row in the database. Inserts, updates
Insert 1 Two Delete T3 Two
and deletes from the source system are distinguished by their operation code, and
THIS IS THE FINAL STATE
a second extra column indicates uniquely the order in which changes were applied
to the source. You can use HVR to create extra metadata columns and transform
updates and deletes into new rows in destination files with just a few clicks.

Data Publication
Log-based change data capture on the source system(s) is aware of the transaction
boundaries. HVR’s manifest agent plugin allows users to publish the data at the end
of an integration cycle at a transactionally consistent point in time. By default, the
manifest is written as a JSON file and includes metadata such as the file name(s) and
the number of rows in the file(s). Organizations build solutions around the availability
of data at a consistent point in time based on the published data in the JSON format.

hvr-software.com | 8
File Management
OPERATION KEY VALUE When running continuous integration into a data lake, file management can become a
Insert 1 One
burden due to the sheer number of files being created. HVR can automatically create
Update Two

Delete 1 Two
folders per table. For each table, you can also configure time slices to further break
FINAL STATE <empty> <empty>
down the file grouping; for example, depending on the source system commit time,
you might create files for every minute, every hour or every day. You can also use data
values in the folder structure to implement a partitioning strategy.

HIVE External Tables


HIVE external tables provide an easy way to access data stored on HDFS or in S3.
HVR’s support for HIVE table creation allows you to create the table definition using
compatible data types and automatically propagate DDL changes. Data scientists or
data analysts can now use their favorite Business Intelligence tool to perform data
discovery.

Big Data Compare


Data lake users are often concerned about the validity of the data. Is the data in files
on the file system a correct representation of the data in tables and columns on the
source system? HVR’s big data compare function uses HIVE external tables to
address this concern by allowing users to compare data in relational tables on the
source to its representation on the file system. HVR leverages the parallel power of
Hadoop to compute a current representation of the data by merging all changes with
the initial data set.

hvr-software.com | 9
Data Lake
Blueprint 2
STREAMING PIPELINE

hvr-software.com | 10
Data Lake Blueprint 2 – Streaming Pipeline
Rapidly gaining in popularity, the second data lake blueprint uses a streaming pipeline, such as Apache Kafka, Amazon Kinesis, or
Azure Event Hub as a data stream processing platform to initially store data, perform the first set of analyses, and distribute data
to its final destination(s). This blueprint has many similarities to the first (file system) blueprint. Like a file system, the streaming
pipeline expects a stream of changes rather than direct updates or deletes to the data.

Trail of Changes
Once a message is pushed into a Kafka topic, Kinesis stream, or Azure Event Hub,
the message cannot change and anyone may consume it. To track row changes in
a source system, a sequence of changes must be pushed to a row based on the row
identifier. Extra metadata indicates whether the most recent operation was an insert,
an update or a delete, and uniquely indicates the order in which the change was
processed on the source system.

JSON Messages
The self-descriptive JSON format is the dominant choice for organizations looking
to build their data lake using Kafka. The jury is still out on whether updates should
include before-row images or only the after-row information if a primary key is
available and is unchanged. If the identifying key changes, you must include the
before image.

Avro is a distant second choice of message format.

hvr-software.com | 11
Topics/Streams per Micro Service or Table
A topic/stream for each micro service application or schema + table may be created,
depending on the data source. Topics/Streams are automatically created when
data arrives for a table. The benefit of a topic per micro service or per table is data
processing simplicity. The disadvantage is that this implementation requires you to
ensure consistency across multiple tables. You’ll need a publication mechanism to
maintain transactional consistency between the streaming pipeline (e.g. the initial
landing zone) and the data lake.

Initial Load
As with any data lake, this blueprint requires an initial data set. The initial data set
may flow through the streaming pipeline. But if data volumes are large, you can opt to
perform the initial load directly to the ultimate target(s) (consumer(s)).

hvr-software.com | 12
Data Lake
Blueprint 3
SCALE OUT DATABASE

hvr-software.com | 13
Data Lake Blueprint 3 – Scale Out Database
Organizations have been using relational database management systems for data lakes for several years. The use of this data lake
blueprint was declining in favor of Blueprints 1 or 2 until relatively recently when many organizations started successfully adopting
the cloud-native technology, Snowflake, as their technology of choice for the data lake, reviving the interest in this third blueprint.
This blueprint has a number of attractive attributes, particularly when an organization is first starting on its data lake journey.

Relational Database Attributes


The relational database has been around for a long time and has a number of well-understood attributes:

• Users can employ sophisticated SQL queries to access data, and off-the-shelf
Business Intelligence tools are readily available to explore the data.

• Transactional behavior of the database is defined around the concepts of ACID


(Atomicity, Consistency, Isolation, Durability). This simplifies the data load
strategy since only committed data is visible to end users.

RELATIONAL DATABASE • The concept of schemas, which determine the name space of database objects,
is used when the data initially lands in the data lake. This allows tables with the
same name but different definitions (from different source systems) to reside in
the same database unmodified.

hvr-software.com | 14
Scale Out Database
As the number of source systems grow, organizations want to know they can manage
their data volume today, tomorrow and the next day. Whether the destination is an on-
SNOWFLAKE TERADATA GREENPLUM premises or cloud-based database, organizations typically use a scale out database
such as Snowflake, Teradata, Greenplum, Redshift, Google BigQuery or Azure Synapse
Analytics (formerly called the Azure Data Warehouse) to handle the exponential growth
of their data lake.

REDSHIFT GOOGLE AZURE SYNAPSE


BIGQUERY ANALYTICS
Table Creation And Initial Data Load
As they implement their data lake, HVR customers generally take advantage of HVR’s
built-in table creation and initial data load capabilities. The benefits include:

• Data types from source database tables are automatically mapped to compatible
data types on the target database (including the right data length/precision).
Table changes on source systems (DDL changes) can be automatically
propagated to the destination database in a heterogeneous environment.

• Data moving between source and target is always compressed to limit network
utilization. With the increasingly common use of data lakes in the cloud, data

Initial data load is synchronized with compression provides major performance advantages and can also deliver cost

the continuous incremental data feed. benefits for some cloud providers. Data can also be encrypted using SSL keys.

This allows organizations to initially load • A database-specific direct load method on the target system is used to get the
data and then store incremental changes data in as quickly as possible. For example, copy from (compressed) files into
that occur subsequently without losing Snowflake, Teradata Parallel Transporter (TPT) is used for Teradata, gpfdist
changes or storing changes twice. is leveraged when loading data into Greenplum, and Redshift uses copy from
(compressed) files in S3.

hvr-software.com | 15
Continuous Micro Batches
Because the use of SQL statements makes processing updates a straightforward
process, the database-based data lake generally duplicates the current state of the
source database—except for deletes. While a deleted row is usually marked in a
separate column, some use cases store a history of database changes, just as the file
system blueprint stores a trail of changes.

Most scale out databases are not built to handle the transaction volume of traditional
relational databases, which can process thousands of transactions, and tens of
thousands of row changes every second at peak times. Simply replicating the
workload, which can come from multiple busy source systems, onto the destination
scale out database would not work.

What does work is running micro batches that efficiently move data into staging
tables. Set-based SQL statements then perform multiple row changes in a single
statement. HVR uses this so-called burst mode to move the data into staging tables
while ensuring that the destination tables always reflect a transactionally consistent
state of the source system.

Data Compare
End users must be confident in the data they explore in the data lake. The relational
attributes of this data lake facilitate programmatic comparisons with data in the source
(often relational) database. HVR’s compare function allows you to use filters against the
data to balance resource utilization, compare job duration, and improve confidence in
the data. To minimize the impact of the compare job on database performance, you can
schedule jobs to execute when there is least load on the systems.

hvr-software.com | 16
Summary
BLUEPRINT COMPARISON

hvr-software.com | 17
Summary: Blueprint Comparison
Organizations begin new data lake projects every day. When they do, they must determine the best deployment option. While the
blueprints described in this paper are not the only options, this discussion will give you an overview of some of your choices. The
following table summarizes the high-level attributes for the three blueprints described here to help you get started.

BLUEPRINT 1 BLUEPRINT 2 BLUEPRINT 3

File System Streaming Pipeline Scale Out Database


(HDFS, S3, ADLS etc.) (Kafka, Kinesis, Azure Event Hub)

SQL Access HIVE, limited complexity KSQL, Kinesis Analytics Yes


Store trail of
Yes Yes Optional
changes
Perform Updates yes, deletes often mark
Not directly No
updates/deletes rows as deleted
Data stored per
Yes Typically, or per micro service Yes
table
Data publication Yes if topics are per table and
For consistency, yes No
required consistency matters
Data format CSV, Avro, Parquet JSON, Avro Relational tables
Data compare
Yes, through HIVE Not currently Yes
available
Cost per TB Can be low Can be low Generally high

hvr-software.com | 18
About HVR
REAL-TIME DATA REPLICATION FOR THE ENTERPRISE
We accelerate data movement so that you can revolutionize your business. HVR is designed to move large volumes of data fast
and efficiently in complex environments for real-time reporting and analytics.

Our goal is to keep your data moving and in sync as you adopt new technologies for storing, streaming, and analyzing data.
Our scalable solution gives you everything you need for efficient data replication from beginning to end so that you can readily
revolutionize your business for the modern world.

+1500
+1500
DEPLOYMENTS OF HVR
THE WORLD’S Largest
Largest
. . . and counting
DIGITAL INDUSTRIAL WE HAVE HELPED
MANUFACTURER USES HVR
TO CONSOLIDATE
+300
+150
COMPANIES
+100 REVOLUTIONIZE
DATABASE SYSTEMS their business and become
INTO A CENTRALIZED REAL-TIME COMPANIES
DATA LAKE

OUR FAVORITE
WE HAVE CUSTOMERS IN

+30
CHALLENGE? HQ
Moving
Moving Data
Data Faster
Faster SAN FRANCISCO
COUNTRIES and More Efficiently + OFFICES
as Data Volumes • United Kingdom

Increase • Netherlands
• China
010101010
01010101
010101010
010
010101
01010101
010101010
0101010
01010

hvr-software.com | 19
Sign Up For A Live Demo Today
hvr-software.com/weekly-webinar-demo/

hvr-software.com
hvr-software.com | 20

You might also like