Architecting Data Lakes Zaloni PDF
Architecting Data Lakes Zaloni PDF
Architecting Data Lakes Zaloni PDF
m
pl
im
en
Architecting
ts
of
Data Lakes
Data Management Architectures
for Advanced Business Use Cases
Alice LaPlante
& Ben Sharma
Award-winning platform for integrated
data lake managment
DONT GO IN THE
LAKE WITHOUT US
LEARN MORE
www.zaloni.com
Architecting Data Lakes
Data Management Architectures for
Advanced Business Use Cases
The OReilly logo is a registered trademark of OReilly Media, Inc. Architecting Data
Lakes, the cover image, and related trade dress are trademarks of OReilly Media,
Inc.
While the publisher and the authors have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the authors disclaim all responsibility for errors or omissions, including without
limitation responsibility for damages resulting from the use of or reliance on this
work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is sub
ject to open source licenses or the intellectual property rights of others, it is your
responsibility to ensure that your use thereof complies with such licenses and/or
rights.
978-1-491-95257-3
[LSI]
Table of Contents
1. Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
What Is a Data Lake? 2
Data Management and Governance in the Data Lake 8
How to Deploy a Data Lake Management Platform 10
iii
Using a Bottom-Up Approach to Data Governance to Rank
Data Sets 47
Data Lakes in Different Industries 48
6. Looking Ahead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Ground-to-Cloud Deployment Options 51
Looking Beyond Hadoop: Logical Data Lakes 52
Federated Queries 52
Data Discovery Portals 52
In Conclusion 53
A Checklist for Success 53
iv | Table of Contents
CHAPTER 1
Overview
1 Gartner. Gartner Survey Highlights Challenges to Hadoop Adoption. May 13, 2015.
1
as a tactical rather than strategic tool, supplementing as opposed to
replacing the EDW. Thats because organizations question whether
Hadoop can meet their enterprise service-level agreements (SLAs)
for availability, scalability, performance, and security.
Until now, few companies have managed to recoup their invest
ments in big data initiatives using Hadoop. Global organizational
spending on big data exceeded $31 billion in 2013, and this is pre
dicted to reach $114 billion in 2018.2 Yet only 13 percent of these
companies have achieved full-scale production for their big-data ini
tiatives using Hadoop.
One major challenge with traditional EDWs is their schema-on-
write architecture, the foundation for the underlying extract, trans
form, and load (ETL) process required to get data into the EDW.
With schema-on-write, enterprises must design the data model and
articulate the analytic frameworks before loading any data. In other
words, they need to know ahead of time how they plan to use that
data. This is very limiting.
In response, organizations are taking a middle ground. They are
starting to extract and place data into a Hadoop-based repository
without first transforming the data the way they would for a tradi
tional EDW. After all, one of the chief advantages of Hadoop is that
organizations can dip into the database for analysis as needed. All
frameworks are created in an ad hoc manner, with little or no prep
work required.
Driven both by the enormous data volumes as well as costHadoop
can be 10 to 100 times less expensive to deploy than traditional data
warehouse technologiesenterprises are starting to defer labor-
intensive processes of cleaning up data and developing schema until
theyve identified a clear business need.
In short, they are turning to data lakes.
2 | Chapter 1: Overview
built using Hadoop. The data can be structured or unstructured.
You can then use a variety of storage and processing toolstypically
tools in the extended Hadoop familyto extract value quickly and
inform key organizational decisions.
Because all data is welcome, data lakes are an emerging and power
ful approach to the challenges of data integration in a traditional
EDW (Enterprise Data Warehouse), especially as organizations turn
to mobile and cloud-based applications and the IoT.
Some of the benefits of a data lake include:
The kinds of data from which you can derive value are unlimited.
You can store all types of structured and unstructured data in a
data lake, from CRM data, to social media posts.
You dont have to have all the answers upfront.
Simply store raw datayou can refine it as your understanding
and insight improves.
You have no limits on how you can query the data.
You can use a variety of tools to gain insight into what the data
means.
You dont create any more silos.
You gain a democratized access with a single, unified view of
data across the organization.
The differences between EDWs and data lakes are significant. An
EDW is fed data from a broad variety of enterprise applications.
Naturally, each applications data has its own schema. The data thus
needs to be transformed to conform to the EDWs own predefined
schema.
Designed to collect only data that is controlled for quality and con
forming to an enterprise data model, the EDW is thus capable of
answering a limited number of questions. However, it is eminently
suitable for enterprise-wide use.
Data lakes, on the other hand, are fed information in its native form.
Little or no processing is performed for adapting the structure to an
enterprise schema. The structure of the data collected is therefore
not known when it is fed into the data lake, but only found through
discovery, when read.
4 | Chapter 1: Overview
required. Many organizations invest in standardization committees
that meet and deliberate over standards, and can take months or
even years to complete the task at hand.
These committees must do a lot of upfront definitions: first, they
need to delineate the problem(s) they wish to solve. Then they must
decide what questions they need to ask of the data to solve those
problems. From that, they design a database schema capable of sup
porting those questions. Because it can be very difficult to bring in
new sources of data once the schema has been finalized, the com
mittee often spends a great deal of time deciding what information
is to be included, and what should be left out. It is not uncommon
for committees to be gridlocked on this particular issue for weeks or
months.
With this approach, business analysts and data scientists cannot ask
ad hoc questions of the datathey have to form hypotheses ahead of
time, and then create the data structures and analytics to test those
hypotheses. Unfortunately, the only analytics results are ones that
the data has been designed to return. This issue doesnt matter so
much if the original hypotheses are correctbut what if they arent?
Youve created a closed-loop system that merely validates your
assumptionsnot good practice in a business environment that
constantly shifts and surprises even the most experienced business
persons.
The data lake eliminates all of these issues. Both structured and
unstructured data can be ingested easily, without any data modeling
or standardization. Structured data from conventional databases is
placed into the rows of the data lake table in a largely automated
process. Analysts choose which tag and tag groups to assign, typi
cally drawn from the original tabular information. The same piece
of data can be given multiple tags, and tags can be changed or added
at any time. Because the schema for storing does not need to be
defined up front, expensive and time-consuming modeling is not
needed.
6 | Chapter 1: Overview
organizations can use data lakes to get better visibility into data,
eliminate data silos, and capture 360-degree views of customers.
With data lakes, organizations can finally unleash Big Datas poten
tial across industries.
Easier accessibility
Data lakes also solve the challenge of data integration and accessibil
ity that plague EDWs. Using Big Data Hadoop infrastructures, you
can bring together ever-larger data volumes for analyticsor simply
store them for some as-yet-undetermined future use. Unlike a mon
olithic view of a single enterprise-wide data model, the data lake
allows you to put off modeling until you actually use the data, which
creates opportunities for better operational insights and data discov
Reduced costs
Because of economies of scale, some Hadoop users claim they pay
less than $1,000 per terabyte for a Hadoop cluster. Although num
bers can vary, business users understand that because its no longer
excessively costly for them to store all their data, they can maintain
copies of everything by simply dumping it into Hadoop, to be dis
covered and analyzed later.
Scalability
Big Data is typically defined as the intersection between volume,
variety, and velocity. EDWs are notorious for not being able to scale
beyond a certain volume due to restrictions of the architecture. Data
processing takes so long that organizations are prevented from
exploiting all their data to its fullest extent. Using Hadoop, petabyte-
scale data lakes are both cost-efficient and relatively simple to build
and maintain at whatever scale is desired.
8 | Chapter 1: Overview
The data lake offers a middle ground. A Hadoop data lake is flexible,
scalable, and cost-effectivebut it can also possess the discipline of
a traditional EDW. You must simply add data management and gov
ernance to the data lake.
Once you decide to take this approach, you have four options for
action.
10 | Chapter 1: Overview
Then we provide a data lake reference architecture designed by
Zaloni to represent best practices in building a data lake. Well also
talk about the challenges that companies face building and manag
ing data lakes.
The most important chapters of the book discuss why an integrated
approach to data lake management and governance is essential, and
describe the sort of solution needed to effectively manage an
enterprise-grade lake. The book also delves into best practices for
consuming the data in a data lake. Finally, we take a look at whats
ahead for data lakes.
13
Figure 2-1. Zalonis data lake architecture
The main advantage of this architecture is that data can come into
the data lake from anywhere, including online transaction process
ing (OLTP) or operational data store (ODS) systems, an EDW, logs
or other machine data, or from cloud services. These source systems
include many different formats, such as file data, database data, ETL,
streaming data, and even data coming in through APIs.
The data is first loaded into a transient loading zone, where basic
data quality checks are performed using MapReduce or Spark by
leveraging the Hadoop cluster. Once the quality checks have been
performed, the data is loaded into Hadoop in the raw data zone, and
sensitive data can be redacted so it can be accessed without revealing
personally identifiable information (PII), personal health informa
tion (PHI), payment card industry (PCI) information, or other
kinds of sensitive or vulnerable data.
Data scientists and business analysts alike dip into this raw data
zone for sets of data to discover. An organization can, if desired, per
form standard data cleansing and data validation methods and place
the data in the trusted zone. This trusted repository contains both
master data and reference data.
Master data is the basic data sets that have been cleansed and valida
ted. For example, a healthcare organization may have master data
sets that contain basic member information (names, addresses,) and
members additional attributes (dates of birth, social security num
bers). An organization needs to ensure that this reference data kept
in the trusted zone is up to date using change data capture (CDC)
mechanisms.
Data Ingestion
Organizations have a number of options when transferring data to a
Hadoop data lake. Managed ingestion gives you control over how
data is ingested, where it comes from, when it arrives, and where it
is stored in the data lake.
A key benefit of managed ingestion is that it gives IT the tools to
troubleshoot and diagnose ingestion issues before they become
problems. For example, with Zalonis Data Lake Management Plat
form, Bedrock, all steps of the data ingestion pipeline are defined in
advance, tracked, and logged; the process is repeatable and scalable.
Bedrock also simplifies the onboarding of new data sets and can
ingest from files, databases, streaming data, REST APIs, and cloud
storage services like Amazon S3.
When you are ingesting unstructured data, however, you realize the
key benefit of a data lake for your business. Today, organizations
consider unstructured data such as photographs, Twitter feeds, or
blog posts to provide the biggest opportunities for deriving business
value from the data being collected. But the limitations of the
schema-on-write process of traditional EDWs means that only a
small part of this potentially valuable data is ever analyzed.
Data governance
An important part of the data lake architecture is to first put data in
a transitional or staging area before moving it to the raw data reposi
tory. It is from this staging area that all possible data sources, exter
nal or internal, are either moved into Hadoop or discarded. As with
Zaloni Bedrock
A fully integrated data lake management platform that manages
ingestion, metadata, data quality and governance rules, and
operational workflows.
Data Processing
Processing is the stage in which data can be transformed into a
standardized format by business users or data scientists. Its neces
sary because during the process of ingesting data into a data lake,
the user does not make any decisions about transforming or stand
ardizing the data. Instead, this is delayed until the user reads the
data. At that point, the business users have a variety of tools with
which to standardize or transform the data.
One of the biggest benefits of this methodology is that different
business users can perform different standardizations and transfor
mations depending on their unique needs. Unlike in a traditional
EDW, users arent limited to just one set of data standardizations
and transformations that must be applied in the conventional
schema-on-write approach.
Data Access
This stage is where the data is consumed from the data lake. There
are various modes of accessing the data: queries, tool-based extrac
tions, or extractions that need to happen through an API. Some
applications need to source the data for performing analyses or
other transformations downstream.
Visualization is an important part of this stage, where the data is
transformed into charts and graphics for easier comprehension and
consumption. Tableau and Qlik are two tools that can be employed
for effective visualization. Business users can also use dashboards,
either custom-built to fit their needs, or off-the-shelf Microsoft SQL
Server Reporting Services (SSRS), Oracle Business Intelligence
Enterprise Edition (OBIEE), or IBM Cognos.
Application access to the data is provided through APIs, Message
Queue, and database access.
Heres an example of what your technology stack might look like at
this stage:
Qlik
Allows you to create visualizations, dashboards, and apps that
answer important business questions.
Tableau
Business intelligence software that allows users to connect to
data, and create interactive and shareable dashboards for visual
ization.
Spotfire
Data visualization and analytics software that helps users
quickly uncover insights for better decision-making.
RESTful APIs
An API that uses HTTP requests to GET, PUT, POST, and
DELETE data.
A Combined Approach
Some vendors also take a combined approach, utilizing benefits
from the top-down and bottom-up processes. For example, some
top-down process is essential if the data from the data lake is going
to be a central part of the enterprises overall data architecture. At
the same time, much of the data lake can be managed from the bot
tom upincluding managed data ingestion, data inventory, data
enrichment, data quality, metadata management, data lineage, work
flow, and self-service access.
Metadata
A solid governance strategy requires having the right metadata in
place. With accurate and descriptive metadata, you can set policies
and standards for managing and using data. For example, you can
create policies that enforce users ability to acquire data from certain
places; which users own and are therefore responsible for the data;
which users can access the data; how the data can be used, and how
its protectedincluding how it is stored, archived, and backed up.
Your governance strategy must also specify how data will be audited
to ensure that you are complying with government regulations. This
can be tricky as diverse data sets are combined and transformed.
All this is possible if you deploy a robust data management platform
that provides the technical, operational, and business metadata that
third-party governance tools need to work effectively.
Rate of Change
The Hadoop ecosystem is large, complex, and constantly changing.
Keeping up with the developments in the open-source community
can be a full-time job in and of itself. Each of the components is
continually evolving, and new tools and solutions are constantly
27
emerging from the community. For an overview of the Hadoop eco
system, check out The Hadoop Ecosystem Table on GitHub.
Technological Complexity
Finally, youve got the complexity of deploying the technology itself.
Youve got to pull together an ecosystem that encompasses hard
ware, software, and applications. As a distributed filesystem with a
large and ever-changing ecosystem, Hadoop requires you to inte
grate a plethora of tools to build your data lake.
Ingestion
Ingestion is the process of moving data into the distributed Hadoop
file system. Deploying a solution that can perform managed inges
tion is critical, because it supports ingestion from streaming sources
like log files, or physical files landed on an edge node outside
Hadoop. Data quality checks are performed after data is moved to
HDFS, so you can leverage the cluster resources and perform the
data-quality checks in a distributed manner.
Its important to understand that all data in the lake is not equal. You
need governance rules that can be flexible, based on the type of data
that is being ingested. Some data should be certified as accurate and
of high quality. Other data might require less accuracy, and therefore
different governance rules.
The basic requirements when ingesting data into the data lake
include the following:
Lack of Visibility
Without the proper tools, you lack visibility and transparency into
the data lake. Ideally, a solution will organize and catalog data as it
arrives, and then provide simple user interfaces to search it. A solu
tion might also create Hive tables for the data, which can be queried
using SQL.
33
data swamp, organizations can deploy integrated management and
governance platforms that allow them to manage, automate, and
execute operational tasks in the data lake. This saves them both
development time and money.
Data Governance
Its important to note that in addition to the tools required to main
tain governance, having a processfrequently a manual processis
also required. Process can be as simple as assigning stewards to new
data sets, or forming a data lake enterprise data council, to establish
data definitions and standards.
Questions to ask when considering goals for data governance:
Quality and consistency
Is the data of sufficient quality and consistency to be useful to
business users and data scientists in making important discover
ies and decisions?
Policies and standards
What are the policies and standards for ingesting, transforming,
and using data, and are they observed uniformly throughout the
organization?
Security, privacy, and compliance
Is access to sensitive data limited to those with the proper
authorization?
Data lifecycle management
How will we manage the lifecycle of the data? At what point will
we move it from expensive, Tier-1 storage to less expensive stor
age mechanisms?
Data Acquisition
Although you have many options when it comes to getting data into
Hadoop, doing so in a managed way means that you have control
over what data is ingested, where it comes from, when it arrives, and
where in Hadoop it is stored. A well-managed data ingestion pro
cess simplifies the onboarding of new data sets and therefore the
development of new use cases and applications.
As we discussed in Chapter 3, the first challenge is ingesting the data
getting the data into the data lake. An integrated data lake man
agement platform will perform managed ingestion, which involves
getting the data from the source systems into the data lake and mak
ing sure it is a process that is repeatable, and that if anything fails in
the daily ingest cycle, there will be operational functions that take
care of it.
For example, a platform implementing managed ingestion can raise
notifications and captures logs, so that you can debug why an inges
tion failed, fix it, and restart the process. This is all tied with post-
processing once the data is stored in the data lake.
Additionally, as we see more and more workloads going toward a
streaming scenario, whatever data management functions you
applied to batch ingestionwhen data was coming in periodically
now needs to be applied to data that is streaming in continuously.
Integrated data lake management platforms should be able to detect
if certain streams are not being ingested based on the SLAs you set.
A data lake management platform should ensure that the capabili
ties available in the batch ingestion layer are also available in the
streaming ingestion layer. Metadata still needs to be captured and
data quality checks need to be performed for streaming data. And
you still need to validate that the record format is correct, and that
Data Acquisition | 35
the record values are correct by doing range checks or reference
integrity checks.
By using a data management solution purpose-built to provide these
capabilities, you build the foundation for a well-defined data pipe
line. Of course, you need the right processes, too, such as assigning
stewards for new data sets that get ingested.
Data Organization
When you store data, depending on the use case, you may have
some security encryption requirements to consider. Data may need
to be either masked or tokenized, and protected with proper access
controls.
A core attribute of the data lake architecture is that multiple groups
share access to centrally stored data. While this is very efficient, you
have to make sure that all users have appropriate permission levels
to view this information. For example, in a healthcare organization,
certain information is deemed private by law, such as PHI (Pro
tected Health Information), and violatorsorganizations that dont
protect this PHIare severely penalized.
The data preparation stage is often where sensitive data, such as
financial and health information, is protected. An integrated man
agement platform can perform masking (where data from a field is
completely removed) and tokenization (changing parts of the data to
something innocuous). This type of platform ensures you have a
policy-based mechanism, like access control lists, that you can
enforce to make sure the data is protected appropriately.
Its also important to consider the best format for storing the data.
You may need to store it in the raw format in which it came, but you
may also want to store it in a format that is more consumable for
business users, so that queries will run faster. For example, queries
run on columnar data sets will return much faster results than those
in a typical row data set. You may also want to compress the data, as
it may be coming in in large volumes, to save on storage.
Also, when storing data, the platform should ideally enable you to
automate data lifecycle management functions. For example, you
may store the data in different zones in the data lake, depending on
different SLAs. For example, as raw data comes in, you may want to
store it in a hot zone where data is stored that is used very fre
Data Catalog
With the distributed HDFS filesystem, you ingest information that is
first broken up into blocks, and then written in a distributed man
ner in the cluster. However, sometimes you need to see what data
sets exist in the data lake, the properties of those data sets, the inges
tion history of the data set, the data quality, and the key perfor
mance indicators (KPIs) of the data as it was ingested. You should
also see the data profile, and all the metadata attributes, including
those that are business, technical, and operational. All of these
things need to be abstracted to a level to where the user can under
stand them, and use that data effectivelythis is where the data lake
catalog comes in.
Your management platform should make it easy to create a data cat
alog, and to provide that catalog to business users, so they can easily
search itwhether searching for source system, schema attributes,
subject area, or time range. This is essential if your business users
are to get the most out of the data lake, and use it in a swift and agile
way.
With a data catalog, users can find data sets that are curated, so that
they dont spend time cleaning up and preparing the data. This has
already been done for them, particularly in cases of data that has
made it to the trusted area. Users are thus able to select the data sets
they want for model building without involving IT, which shortens
the analytics timeline.
Capturing Metadata
Metadata is extraordinarily important to managing your data lake.
An integrated data lake management platform makes metadata cre
ation and maintenance an integral part of the data lake processes.
This is essential, as without effective metadata, data dumped into a
data lake may never be seen again.
You may have a lot of requirements that are defined by your organi
zations central data authority, by your chief data officer or data
Capturing Metadata | 37
stewards in your lines of business, who may want to specify the vari
ous attributes and entities of data that they are bringing into the data
lake.
Metadata is critical for making sure data is leveraged to its fullest.
Whether manually collected or automatically created during data
ingestion, metadata allows your users to locate the data they want to
analyze. It also provides clues for future users to understand the
contents of a data set and how it could be reused.
As data lakes grow deeper and more important to the conduct of
daily business, metadata is a vital tool in ensuring that the data we
pour into these lakes can be found and harnessed for years to come.
There are three distinct but equally important types of metadata to
collect: technical, operational, and business data, as shown in
Table 4-1.
Technical metadata captures the form and structure of each data set.
For example, it captures the type of data file (text, JSON, Avro) and
the structure of the data (the fields and their types), and other tech
nical attributes. This is either automatically associated with a file
upon ingestion or discovered manually after ingestion. Operational
metadata captures the lineage, quality, profile, and provenance of
the data at both the file and the record levels, the number of records,
and the lineage. Someone must manually enter and tag entities with
operational metadata. Business metadata captures what the user
needs to know about the data, such as the business names, the
descriptions of the data, the tags, the quality, and the masking rules
for privacy. All of this can be automatically captured by an integra
ted data management platform upon ingestion.
All of these types of metadata should be created and actively curated
otherwise, the data lake is simply a wasted opportunity. Addition
Data Preparation
Making it easier for business users to access and use the data that
resides in the Hadoop data lake without depending on IT assistance
is critical to meeting the business goals the data lake was created to
solve in the first place.
However, just adding raw data to the data lake does not make that
data ready for use by data and analytics applications: data prepara
tion is required. Inevitably, data will come into the data lake with a
certain amount of errors, corrupted formats, or duplicates. A data
management platform makes it easier to adequately prepare and
clean the data using built-in functionality that delivers data security,
quality, and visibility. Through workflow orchestration, rules are
automatically applied to new data as it flows into the lake.
For example, Bedrock allows you to automatically orchestrate and
manage the data preparation process from simple to complex, so
that when your users are ready to analyze the data, the data is avail
able.
Data preparation capabilities of an integrated data lake management
platform should include:
Data Preparation | 39
Additionally, when going from raw to refined, you might want to
watermark the data by assigning a unique ID for each record of the
data, so you can trace a record back to its original file. You can
watermark at either the record or file level. Similarly, you may need
to do format conversions as part of your data preparation, for exam
ple, if you prefer to store the data in a columnar format.
Other issues can arise. You may have changes in data coming from
source systems. How do you reconcile that changed data with the
original data sets you brought in? You should be able to maintain a
time series of what happens over a period of time.
A data management platform can do all of this, and ensure that all
necessary data preparation is completed before the data is published
into the data lake for consumption.
Data Provisioning
Self-service consumption is essential for a successful data lake. Dif
ferent types of users consume the data, and they are looking for dif
ferent thingsbut each wants to access the data in a self-service
manner, without the help of IT:
The Executive
An executive is usually a person in senior management looking for
high-level analyses that can help her make important business deci
sions. For example, an executive could be looking for predictive ana
lytics of product sales based on history and analytical models built
by data scientists. In an integrated data lake management platform,
data would be ingested from various sourcessome streaming,
some batch, and then processed in batches to come up with insights,
with the final data able to be visualized using Tableau or Excel.
Another common example is an executive who needs a 360-degree
view of a customer, including metrics from every level of the organi
zationpre-sales, sales, and customer supportin a single report.
A Downstream System
A fourth type of consumer is a downstream system, such as an
application or a platform, which receives the raw or refined data.
Leading companies are building new applications and products on
top of their data lake, so they are also consumers of the data. They
may also use RESTful APIs or some other API mechanisms, on an
ongoing manner. For example, if the downstream application is a
database, the data lake can ingest and transform the data, and send
the final aggregated data to the downstream system for storage.
Self-Service
For our purposes, self-service means that non-technical business
users can access and analyze data without involving IT.
45
In a self-service model, users should be able to see the metadata and
profiles and understand what the attributes of each data set mean.
The metadata must provide enough information for users to create
new data formats out of existing data formats, using enrichments
and analytics.
Also, in a self-service model, the catalog will be the foundation for
users to register all of the different data sets in the data lake. This
means that users can go to the data lake and search to find the data
sets they need. They should also be able to search on any kind of
attributefor example, on a time window such as January 1st to
February 1stor based on a subject area, such as marketing versus
finance. Users should also be able to find data sets based on
attributesfor example, they could enter, Show me all of the data
sets that have a field called discount or percentage.
It is in the self-service capability that best practices for the various
types of metadata come into play. Business users are interested in
the business metadata, such as the source systems, the frequency
with which the data comes in, and the descriptions of the data sets
or attributes. Users are also interested in knowing the technical met
adata: the structure and format and schema of the data.
When it comes to operational data, users want to see information
about lineage, including when the data was ingested into the data
lake, and whether it was raw at the time of ingestion. If the data was
not raw when ingested, users should be able to see how was it cre
ated, and what other data sets were used to create it. Also important
to operational data is the quality of the data. Users should be able to
define certain rules about data quality, and use them to perform
checks on the data sets.
Users may also want to see the ingestion history. If a user is looking
at streaming data, for example, they might search for days where no
data came in, as a way of ensuring that those days are not included
in the representative data sets for campaign analytics. Overall, access
to lineage information, the ability to perform quality checks, and
ingestion history give business users a good sense of the data, so
they can quickly begin analytics.
Healthcare
Many large healthcare providers maintain millions of records for
millions of patients, including semi-structured reports such as radi
ology images, unstructured doctors notes, and data captured in
spreadsheets and other common computer applications. A data lake
is an obvious solution for such organizations, because it solves the
challenge healthcare providers face with data storage, integration,
and accessibility. By implementing a data lake based on a Hadoop
architecture, a healthcare provider can enable distributed big data
processing, by using broadly accepted, open software standards, and
massively-parallel commodity hardware.
Financial Services
In the financial services industry, data lakes can be used to comply
with the Dodd-Frank regulation. By consolidating multiple EDWs
into one data lake repository, financial institutions can move recon
ciliation, settlement, and Dodd-Frank reporting to a single platform.
This dramatically reduces the heavy lifting of integration, as data is
stored in a standard yet flexible format that can accommodate
unstructured data.
Retail banking also has important use cases for data lakes. In retail
banking, large institutions need to process thousands of applications
for new checking and savings accounts on a daily basis. Bankers that
accept these applications consult third-party risk scoring services
before opening an account, yet it is common for bank risk analysts
to manually override negative recommendations for applicants with
poor banking histories. Although these overrides can happen for
good reasons (say there are extenuating circumstances for a particu
lar persons application), high-risk accounts tend to be overdrawn
and cost banks millions of dollars in losses due to mismanagement
or fraud.
By moving to a Hadoop data lake, banks can store and analyze mul
tiple data streams, and help regional managers control account risk
in distributed branches. They are able to find out which risk analysts
were making account decisions that went against risk information
by third parties. The net result is better control of fraud. Over time,
the accumulation of data in the data lake allows the bank to build
algorithms that detect subtle but high-risk patterns that bank risk
analysts may have previously failed to identify.
51
own private clouds. It can be used to store sensitive or vulnerable
data that organizations cant trust to a public cloud environment. At
the same time, other, less-sensitive data sets can be moved to the
public cloud.
Federated Queries
Federated queries go hand-in-hand with logical data lakes. As data is
stored in different physical and virtual environments, you may need
to use different query tools, and decompose a users query into mul
tiple queriessending them to on-premises data stores as well as
cloud-based data stores, each of which possess just part of the
answer. Federated queries allow answers to be aggregated and com
bined, and sent back to the user so she gets one version of the truth
across the entire logical data lake.
In Conclusion | 53
revenue streams that you can enable business teams to deliver.
Being able to define and articulate this value from a business
standpoint, and convince partners to join you on the journey is
very important to your success.
Architectural oversight
Once you have the business alignment and you know what your
priorities are, you need to define the upfront architecture: what
are the different components you will need, and what will the
end technical platform look like? Keep in mind that this is a
long-term investment, so you need to think carefully about
where the technology is moving. Naturally, you may not have all
the answers upfront, so it might be necessary to perform a proof
of concept to get some experience and to tune and learn along
the way. An especially important aspect of your architectural
plans is a good data-management strategy that includes data
governance and metadata, and how you will capture that. This
is critical if you want to build a managed and governed data lake
instead of the much-maligned data swamp.
Security strategy
Outline a robust security strategy, especially if your data lake
will be a shared platform used by multiple lines of business
units or both internal and external stakeholders. Data privacy
and security are critical, especially for sensitive data such as PHI
and PII. You may even have regulatory rules you need to con
form to. You must also think about multi-tenancy: certain users
might not be able to share data with other users. If you are serv
ing multiple external audiences, each customer might have indi
vidual data agreements with you, and you need to honor them.
I/O and memory model
As part of your technology platform and architecture, you must
think about what the scale-out capabilities of your data lake will
look like. For example, are you going to use decoupling between
the storage and the compute layers? If thats the case, what is the
persistent storage layer? Already, enterprises are using Azure or
S3 in the cloud to store data persistently, but then spinning up
clusters dynamically and spinning them down again when pro
cessing is finished. If you plan to perform actions like these, you
need to thoroughly understand the throughput requirements
from a data ingestion standpoint, which will dictate throughput