Data Fabric As Modern Data Architecture
Data Fabric As Modern Data Architecture
Data Fabric As Modern Data Architecture
Data Architecture
Alice LaPlante
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Fabric as
Modern Data Architecture, the cover image, and related trade dress are trademarks
of O’Reilly Media, Inc.
The views expressed in this work are those of the author, and do not represent the
publisher’s views. While the publisher and the author have used good faith efforts to
ensure that the information and instructions contained in this work are accurate, the
publisher and the author disclaim all responsibility for errors or omissions, includ‐
ing without limitation responsibility for damages resulting from the use of or reli‐
ance on this work. Use of the information and instructions contained in this work is
at your own risk. If any code samples or other technology this work contains or
describes is subject to open source licenses or the intellectual property rights of oth‐
ers, it is your responsibility to ensure that your use thereof complies with such licen‐
ses and/or rights.
This work is part of a collaboration between O’Reilly and TIBCO. See our statement
of editorial independence.
978-1-098-10594-5
[LSI]
Table of Contents
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
iii
Introduction
1 “Data: The Strategic Asset,” Financial Executives Research Foundation, Inc., November
2019, https://oreil.ly/uEflr.
v
Of the five Vs, variety references the different types of data that can
exist. When data variety is high, the complexity of the data increa‐
ses, which is the chief reason businesses are seeking data fabrics:
they have X sources of data, and every source has hundreds of
tables, each with dozens of columns. At the same time, with all these
sources of data they must serve Y users or use cases, each requiring
slightly different data.
Whether data is structured or unstructured is only the beginning of
the complexity facing businesses today. Most are familiar with these
two categories (three if you add semistructured data) and have fig‐
ured out ways to integrate them. But there are a number of other
challenges specifically concerning the variety of data. Chief among
them is performing analytics with mixed-modal data—since tradi‐
tional analytics is designed to work with highly formatted data and
doesn’t like inconsistent or noisy data. This makes it hard to inte‐
grate different types of data together, which is why data lakes are
notoriously difficult to manage. Finally, the quality of data that
exhibits a lot of variety can be low.
A subset of data variety is data distribution. That is, we would argue
that it’s not only the different types, but the number of sources that
raise challenges, especially when considering how much data is
being created and stored in the cloud.
Essentially, data is everywhere, and it is all different. This includes
Internet of Things (IoT) data from a distribution warehouse, real-
time SAP transactions, and Salesforce or other software-as-a-service
(SaaS) datasets. All of these sources may involve customer data of
some kind, but each has a different purpose and different data
consumers.
All the silos in all the departments, each with its own set of tools and
techniques, business rules, and definitions that must be orchestra‐
ted, also add to the complexity.
Questions arise. Where is the data? What kind of data is it? How can
I get the data to the users who need it?
Centralization implies control, and some companies are still pursu‐
ing the goal of having only one, centralized source of data (we’ll
explain why this is not necessarily such a good idea later in this
report). Unsurprisingly (to us), only 6% of companies have achieved
vi | Introduction
this, according to a recent survey from the Business Application
Research Center.2
On the other hand, most companies use multiple data sources (see
Figure I-1). Almost one in five companies (18%) use 20 or more
data sources for decision-making, with this number expected to
grow to 50% in the near future, according to the BARC survey. But
the more data sources you have, the more likely it is that data quality
will be a problem. Data governance thus becomes even more critical
as dependence on multiple data sources increases.
Figure I-1. Most companies use multiple data sources for decision-
making (Source: BARC)
Introduction | vii
• 66% report less operational efficiency as a result of broken
pipelines.
• 59% report delayed decisions or lost opportunities because of
broken pipelines.
4 “Big Data and AI Executive Survey 2019: Executive Summary of Findings,” NewVant‐
age Partners LLC, accessed May 6, 2021, https://oreil.ly/AnyH8.
viii | Introduction
In NewVantage’s 2020 report, however, the news was not particularly
good. Although investment in data was up, showing that companies
generally realize data’s importance, the pace of that investment was
losing momentum. The percentage of companies investing more
than $50 million in data was 65% in 2020, compared to just 40% in
2019. But only 52% of companies were increasing their rate of
investment, compared to the 92% that were doing this in 2019.5
Worse, only 38% reported that they had created a data-driven orga‐
nization. Even fewer—only 27%—had built a data culture. This tells
us that the all-important goal of data democratization is not being
reached. And it’s not necessarily the technology that is holding firms
back. Nine out of 10 companies point to people and process chal‐
lenges as the biggest barriers to data democratization.
5 “NewVantage Partners Releases 2020 Big Data and AI Executive Survey,” Business Wire,
Jan. 6, 2020, https://oreil.ly/SPTy5.
Introduction | ix
their time finding insights.6 The new 70/30 rule is a substantial
improvement on the 80/20 one.
This is becoming possible because businesses are organizing and
managing their data in smarter ways. In particular, by using some‐
thing called a data fabric.
In this report, we’ll first describe the conditions that are pushing the
limits of current data management strategies. Then we’ll explain
what a data fabric is, including its components and its architecture.
We’ll highlight the benefits and some early use cases. Finally, we’ll
provide five pieces of advice for getting started on deploying a data
fabric in your organization, along with some best practices for mak‐
ing sure you’re doing it right.
6 Stewart Bond, “End-User Survey Results: Deployment and Data Intelligence in 2019,”
IDC, November 2019, https://oreil.ly/EihMu.
x | Introduction
CHAPTER 1
Why Build a Data Fabric?
Why do you need this thing called a data fabric? It’s not just because
of the sheer size of your data. You also are faced with access and
integration challenges because of where the data is coming from,
where it’s stored, and in what form. You’ve got data on premises. In
the public cloud. In private clouds. You have data in multicloud and
hybrid cloud ecosystems. Within these various silos, some of the
data is structured but most is unstructured, which raises challenges.
And don’t forget streaming data—that’s an important part of the pic‐
ture, too.
What’s the state of enterprise data, then? Fragmented. A full 93% of
enterprises have a multicloud strategy, with 87% having a hybrid
cloud environment in place, according to Flexera’s 2020 State of the
Cloud survey.1 On average, companies have data stored in 2.2 public
and 2.2 private clouds, as well as in various on-premises data reposi‐
tories (see Figure 1-1).
1 Tanner Luxner, “Cloud Computing Trends: 2021 State of the Cloud Report,” Flexera,
March 15, 2021, https://oreil.ly/skemo.
1
Figure 1-1. The fragmented state of enterprise data (Source: Flexera)
The reasons for this fragmentation are varied, and include the
following:
Time-to-data-insight is a competitive differentiator
Today nearly every business transformation—whether aiming
for greater customer intimacy, more optimized operations, or
faster innovation—is fueled by data-driven insights. The days
when business users would patiently wait weeks or even months
for IT to deliver new datasets are gone. Not only are your users
demanding rapid responses to their queries, but the competitive
nature of today’s markets requires it. The dilemma is that quer‐
ies on databases with billions of records can take hours to
return. The need to change this is urgent, as companies with
data intelligence shared in real time or near-real time are 18
times more likely to make better and faster decisions than their
competitors.2
Demand for self-service data continues to explode
Enabled by easier-to-use, more powerful analytics tools such as
Power BI and Spotfire, business users are demanding more data,
delivered more swiftly. Whether you consider this data democ‐
ratization or data chaos, the trend is very real, and data users’
needs must be satisfied for your organization to maintain a
competitive edge.
2 Adam DeMattia, John McKnight, Jennifer Gahm, and Monya Keane, “Research Proves
IT Transformation’s Persistent Link to Agility, Innovation, and Business Value,” The
Enterprise Strategy Group, Inc., March 2018, https://oreil.ly/sAZUW.
Let’s start with what a data fabric isn’t. It is not a single product or
even a single platform. You can’t buy and deploy it overnight. It is an
architecture. And a journey.
The good news is that you don’t have to rip and replace your exist‐
ing technology. A data fabric encompasses the data ecosystem you
have in place. Neither do you need to be beholden to a single ven‐
dor. You can choose best-of-breed solutions and—in theory at
least—they should all work together within your data fabric.
To summarize what we discussed in Chapter 1, with a data fabric
your users will get to spend more time analyzing their data than
wrangling with it. And other consumers of data—think systems and
applications—will get access to integrated data. It’s as simple as that.
The data fabric is there to make it easier to find data in a way that’s
trusted and gives access to anyone. This is the frame for our entire
data fabric discussion: that a data fabric will drive the old 80/20 rule
(now 70/30) to increasingly favorable proportions.
Some people call it data intelligence rather than data fabric, because
it makes it easier for users and systems/applications to intelligently
find, work with, and clean data, and apply AI models to it.
So what is a data fabric?
A data fabric is a modern, distributed data architecture that includes
shared data assets and optimized data management and integration
processes that you can use to address today’s data challenges in a
unified way.
7
Despite what many vendors might claim, a data fabric is not a single
product or specific platform that you can simply buy and insert into
your existing data architecture. It includes architecture, shared data
assets, and data management and integration technology.
A data fabric supports the following:
Data for all users and use cases
Provides timely, trusted, reusable data for a wide range of ana‐
lytical, operational, and governance use cases, as well as busi‐
ness self-service users
Data from any and all sources
Accesses, combines, and transforms both in-motion and at-rest
data from across a diverse, distributed data landscape using
metadata, models, and pipelines
Data that spans any environment
Flexibly spans distributed on-premises, hybrid, and multicloud
environments
In short, a data fabric’s job is to connect any kind of data to any‐
where and anyone (or anything). That’s admittedly a tall order, as IT
systems are getting more complex as users demand simplicity for
easier, faster decision-making. A data fabric addresses both needs.
Let’s be very clear that many of the components that make up a data
fabric are not new. They’re constantly evolving, true—especially
when the cloud is involved. But it’s the combination of them that cre‐
ates this new thing, this data fabric.
Here are some of the components of a typical data fabric:
Data catalog
Allows you to categorize, access, and collaborate around com‐
pany data across multiple data sources, while enforcing strong
governance and access management.
Master data management
Involves creating a single master record for all business data
from across both internal and external data sources.
Metadata management
How you manage the data that describes other data (the meta‐
data). It involves establishing policies and processes that ensure
1 “Data Fabric Market is Expected to Reach $4.54 Billion by 2026, Says Allied Market
Research,” GlobeNewswire Inc., November 24, 2020, https://oreil.ly/Q16KA.
15
Data fabric use Data challenges solved by data fabric Benefits
case
Compliance • Systems are built for operations, not One place to go for compliance
compliance. data:
• It’s difficult to meet evolving • Comply faster, with fewer
compliance requirements and resources
variations of rules across geographies • Stay out of jail, stay out of the
and LOB operations. media, and avoid fines
Risk • Systemic risk spans organizational and One place to go for risk
management system data silos. management data:
• There is no complete view of risk • Provide a full view of risk
factors and metrics. • Help avoid catastrophes
When integrated with Panera Bread’s data fabric, these tools effec‐
tively made the entire company a part of the data team, making the
business agility that Panera so deftly demonstrated possible.
1 “Panera Cooks with Data to Deliver on Service and Satisfied Customers,” TIBCO Soft‐
ware, Inc., accessed May 25, 2021, https://oreil.ly/X7wgd.
2 “The State of AI in 2020,” McKinsey & Company, November 17, 2020, https://oreil.ly/
qPnJA.
Migrate Intelligently
Data virtualization lets you insulate your consumers as you migrate
your data sources and insulate your sources as you migrate your
consumers. And when you’re finished, you can continue to drive
value from these decoupled data objects.
Don’t Over-Innovate
Keep privacy and governance in mind. Just because you can do
something with your data doesn’t mean you should. A major retailer
got into trouble several years ago by using its data too cleverly, in a
way that violated the privacy of its consumers.3 Always think of the
3 Charles Duhigg, “How Companies Learn Your Secrets,” New York Times Magazine, Feb.
16, 2012, https://oreil.ly/p7QI7.
4 “Big Data and AI Executive Survey 2021,” NewVantage Partners LLC, January 2021,
https://oreil.ly/hHRKP.