Data Lake or Data Swamp?

Data Lake or Data Swamp?
Keeping the Data Lake

from Becoming a
Data Swamp.
KSG Solutions Data Lake or Data Swamp? 1

INTRODUCTION
Increasingly, businesses of all kinds are beginning to see their data

as an important asset that can help make their operations more
effective and profitable. As our ability to gather time-series data
grows, more technologies are becoming available to help us make
sense of it. How do we choose the right technology and approach
for our business problems?
ABOUT THE AUTHOR
J
ohn de Koning, success advisor in industrial data processing, created his roots
in the oil and gas industry. As a technology and innovation manager for Shell,
John was focused on generating $500+ million value annually by introducing
innovative ways of processing manufacturing and production data. He became
an industry leader by introducing architectures to contextualize, integrate and
aggregate manufacturing and production data at a corporate level. The experience
and understanding gained has been used as the foundation for this white paper.
The paper is focused on helping industry leaders understand

the characteristics of the various data processing techniques,
and how they link together to form an optimum solution
architecture for processing time-series data in combination
with enterprise data lake initiatives.

MANAGEMENT SUMMARY
D
ata lakes are a simple way for businesses to collect and store raw data from a variety of inputs, without
having to know in advance exactly how the data will be used. But in order for data to drive business
outcomes, it must be organized and accessible. Without structure, the data lake becomes a swamp.
A variety of advanced real-time software systems are available that integrate with enterprise data lake software
and can help collect and structure data, so it can be used effectively.
Various solutions are available for processing The solution architecture for time-series data should
time-series data. Some of them even pretend to be follow a few strict rules:
the Holy Grail for data management and advocate
streaming sensor and machine data straight to a data 1. Connectivity
lake and the cloud and suggest organizing it later. Ensure the corporate solution is able to connect to
But what about the nature of industrial data streams the variety of (legacy) data sources and potential
and the legacy automation equipment that is already new sources.
out there? Especially in the area of industrial data
2. Time-Series Capacity
environment, automation systems have a life cycle up
to 20 years and replacement is a serious investment. The system should be able to deal with
Sending the raw data from these sources to the data time-series data (high fidelity, time indexing, and time
lake is not even an option as interfaces for these synchronization).
legacy sources do not exist. 3. Context
Systems should have easy-to-understand asset/
Access to the data should be equipment-based relations between the individual
simple and affordable, but still data streams, to enable business users to easily
enable enterprise wide reporting compare, view and analyze data on an equipment
level without being an IT specialist or data scientist.
and analyses.
4. Accessibility
Process users should be able to analyze and
visualize the data to help optimize the use of
production facilities.
5. Security
Keep your production facility safe and secure!
Dont allow unintended back-door access to your
automation system.

O
ften positioned as the one size fits all, the currently available data lake technologies do not have the
ability yet to handle the above key rules in an effective and efficient way. To assure data from a large
variety of (legacy) source systems is landing in the cloud with the correct timestamp, synchronized in time
and having the right context, it is important to add an infrastructure layer specially designed for this purpose.
This combination of time-series and data lake technologies (cloud or on premise) will bring the flexibility and
criticality in the various levels of the organization: (a) on the production level, ensuring that data will be secure
but accessible; and (b) on the corporate level, allowing data to be contextualized, integrated, and aggregated for
better business decision making.
Various solutions are available in the combination of Large companies, like global energy enterprises,
real-time infrastructure and data lake technologies. have proven this technology combination can easily
Based on the above rules, the technology combination drive $500+ million benefit per year by introducing
of OSIsoft PI System toolkit with supporting Data enterprise tools and processes for Proactive
Context Automation tools, like those delivered by Monitoring, Exception Based Surveillance, Rotating
Element Analytics, is a leading strategy for a solution Equipment Monitoring, Condition Based Maintenance,
architecture that supports both the time-series data Margin visualizing, etc. All of these will result in better
needs within operations and enterprise data lake uptime and higher efficiency of the facilities.
initiatives. Dedicated integration tools are available
to easily integrate with Enterprise Data Warehouses
and data lake technologies from Microsoft, SAP, or
Hadoop, both in the cloud or on premise.
Figure 1 : Hybrid Environment

TRANSITION TO DATA LAKES
T
raditional data warehouse technologies use predefined data models to describe the database. The advantage
is that you know upfront what the data structure looks like. The downside is that data warehouses are
inflexible. A traditional data warehouse cannot keep up with rapid changes in the data model due to the
proliferation of new data sources and new questions people want to ask of the data. This overwhelming rate of
change is preventing the traditional way of working by first building a data model and a database schema. In
addition, the traditional way of (data) change management will not work anymore, as version control will be hard
with a fast changing data model.
In a data lake environment, raw data is pushed to the possible for business professionals (non-IT) to query
store in their original state. This can be structured, the data as the complex data modeling when it was
unstructured, blobs, etc. Instead of predefining how already done upfront by IT specialist. In the case of a
the data elements are related to each other (data data lake, you need to be a data scientist to be able
model), as with a data warehouse, you create the to analyze the various chunks of data and link them
relationships once you need to retrieve the data from together to make sense. Table 1 summarizes key
the data lake. This is also the major downside of a characteristics of data warehouses versus data lakes.
data lake. With databases and warehouses, it was
Data Warehouse vs. Data Lake

structured / semi-structured /
structured, processed Data
unstructured, raw
schema-on-write Processing schema-on-read
expensive for large data volumes Storage designed for low-cost storage
highly agile, configure and reconfigure

less agile, fixed configuration Agility
as needed
mature Security maturing
business professionals Users data scientists et. al.
Table 1: Data Warehouse vs. Data Lake

THE PERFECT WORLD IN AN INDUSTRIAL ENVIRONMENT
T
he perfect world is very simple. You want to have access to all the data that is available (internal and
external), query the data in any combination, run integrated analytics to find the missing pieces and visualize
the information you are looking for with the tool of your preference. However, the reality is often different.
When combined with a real-time, time-series environment, the core concerns are related to the diversity of (legacy)
data sources, network latency and reliability, data latency, time synchronization of data streams, and the context or
relationship between data streams.
Figure 2: Perfect World of Data Processing. Whats Missing?

A GREAT ALTERNATIVE TO THE PERFECT WORLD IN AN INDUSTRIAL ENVIRONMENT
A
hybrid model delivered by an ecosystem of suppliers will help to bridge the gap between the Perfect World
and the technology constraints.
Depending on the company size and the equipment data reliability concerns. Remote facilities connected
used for production, the variety in time-series data via low bandwidth connections, like satellites, need
sources can be significant. There will be a legacy of additional functionality to avoid data loss. Another
control and automation systems, especially in older important aspect is security. To assure the integrity
companies with various production locations that may and safe operations of your facility, the interface
include systems from various brands, various types per technology must be very secure. The following table
brand and various versions per type. Sending the raw shows the benefits of adding advanced real-time,
data from these sources to the data lake is not even time-series systems to the hybrid model to address to
an option as interfaces for these legacy sources do not key concerns from the data lake technology.
exist. Also, the facility location can introduce significant
Figure 3: Overview of system characteristics

THE OPTIMUM OF DATA PROCESSING IN THE REAL-TIME, TIME-SERIES WORLD
T
he combination of data lake technology and time-series infrastructure will help to address the core concerns
of the Perfect World. In this situation, time series data infrastructure will collect all the data from the field.
The time-series data infrastructure will also assure the availability of data in the field for local viewing,
processing and reporting (Edge computing) or feeding data to (near) real-time optimization or advanced control.
This Edge computing will assure the data and system availability needed to run and monitor the equipment in the
production process itself by avoiding network availability and data latency issues.
BENEFIT OF INTEGRATING AND architecture with a consistent way of accessing

STANDARDIZING ACCESS TO DATA data and with a consistent way of building a data
The integrated combination of time-series systems model, it will be very easy to build one consistent
and data lakes delivers a One-Stop-Shop model set of analytics per equipment type and deploy this
for data access across the business and operations to all facilities throughout the enterprise. It avoids
enterprise. This enables enterprise wide reporting, reinventing the wheel at the various facilities;
enterprise big data analytics and the delivery of development and deployment of applications will
enterprise applications across a broad spectrum become very Agile; and most important, the time
of use cases. These enterprise applications and to value is very short. Large companies like global
reports can be reused throughout the enterprise energy enterprises can easily drive $500+ million
from one single platform. As the definition of a piece benefit per year by introducing enterprise tools for
of equipment is the same throughout the company, Proactive Monitoring, Exception Based Surveillance,
it is very easy to reuse use cases throughout the Rotating Equipment Monitoring, Condition Based
company. Best practices from one location can be Maintenance, Margin visualizing, etc. This all will
re-deployed with very low effort at other locations result in better uptime and higher efficiency of
to rapidly generate value. In the case of an IT the facilities.
In the Energy world, the use of heat exchangers is quite common.

Fouling of heat exchangers is a serious concern as it slows down
production or forces unplanned outages. This concern is addressed
at all facilities by technologists, all of whom try to invent a way to
predict the fouling of their heat exchangers. However, at the end,
such efforts often result in a huge amount of rework by reinventing
the same wheel.

TIME-SERIES DATA INFRASTRUCTURE WITH Equipment based time-series data infrastructure
DATA LAKE INTEGRATION Equipment vendors like Siemens need systems to
The choice of time-series or real-time infrastructure optimize the service they deliver. They need
technology will depend on the enterprise time-series systems for remote monitoring of
characteristics and requirements. The market their large rotating equipment, like wind turbines.
of real-time infrastructure systems varies in a The growth of this turbine market pushed the
few groups: development of these platforms forward.
Automation vendor-based like Honeywell PHD
or Yokogawa Exaquatum Independent vendor based time-series
Open source-based like InfluxDB, Graphite, and data infrastructure
Prometheus Independent vendors started to address the various
Large Equipment vendor-based like Siemens XHQ gaps in data collection, analyses and visualization.
Vendor independent systems like the OSIsoft Two vendors stand out in this area: AspenTech
PI System with the InfoPlus 21 system, and OSIsoft with the
PI System time-series data infrastructure. Where
Automation vendor-based time-series InfoPlus 21 is more focused on smaller scale,
data infrastructure MES-like functionality and local plants, the OSIsoft
Automation vendors like Honeywell and Yokogawa PI System is designed to be an all-purpose real-time
deliver their own dedicated real-time infrastructure. infrastructure from a single set of assets like wind
These tools integrate very well in their automation turbines, to a whole plant, an enterprise, or even a
toolkit. The downside is that these tools have limited community of enterprises, vendors and regulators
analytical capabilities compared to other toolkits, and who need to capture, share and analyze data. The
dont integrate well in a big data environment. broad variety of interfaces to various types of data
sources (450+) is one of the major advantages of
Open source time-series data infrastructure the OSIsoft PI System toolkit. There is no restriction
Systems like InfluxData have their origin in collecting in getting data into the system. This means no
real-time information from online systems for additional development or unexpected IT cost to
performance monitoring and alerting. Soon after the connect data sources. Meanwhile, a full context
introduction of InfluxData in 2013, the interfaces for engine with streaming analytics enables the huge
collecting real-time data rapidly extended in the world volume, variety and diversity of captured data, and
of social media. Use cases continued to extend in the turns it into valuable information in real-time that
IoT world. InfluxData is an integration of various open anyone can consume, from a plant engineer to a
source initiatives: Telegraf for interfacing, InfluxDB data scientist working within a data lake.
for time-series storage, Chronograf for visualization,
and Kapacitor for detecting and alerting.

OVERVIEW OF CAPABILITIES
Table 2: Comparison of Infrastructure Capabilities

Table 2: Comparison of Infrastructure Capabilities - continued

DATA CONTEXT IS THE KEY TO SUCCESS
I
n order for data to drive business outcomes, it must be organized and accessible. Without structure, the data
lake becomes a swamp. Individual data points have value for engineers very close to the production facility.
Engineers usually know in detail how the facility is built and how to find each data point. However, as soon as
reporting, monitoring or analyses happen outside of the local environment, it becomes important to add structure,
governance, and context to the huge amount of available data points. Knowing the data individually by name is
not an option anymore.
Figure 4: Streaming Operational Data to Multiple Applications
Example: Consider the contextual data that surrounds pump resides, where in a process, and what may
a single lube oil pump in a large facility. Each flow through the pump. Given the diversity of pumps
pump will have a parameter for pump name, power and their various applications and processes, simply
consumption, outlet pressure, outlet flow, outlet comparing all pumps is meaningless for analytics
temp, and filter differential pressure. Furthermore, without this context.
anyone in the organization should know where the

A template approach makes this complex data context for non-IT staff to use the data, but also building
more accessible to all users. With templates, users dont applications and reports will become very fast and easy
have to search for multiple tag names for a data stream to deploy. However, one of the gaps for all available
or need to know the name of the tag. All they need to systems is the high amount of manual labor needed to
know is the pump name. For the other parameters, you make the connections between the data streams and
dont need to know the data stream names anymore. the asset definitions. The problem is not with building
You make this connection between a specific pump and the templates themselves, but with connecting instances
the actual data streams for this pump at the time you of the templates to measuring points in the field. With
add (instantiate) the pump to your system. larger systems of 100k+ data streams, this can become
quite labor intensive and costly.
Once all your assets are modeled on asset templates,
access to the data is very easy. This makes it simple
In the case of OSIsofts PI System, a toolkit is available to automate and significantly reduce the
effort needed to build templates and map data streams into a structure. This toolkit is delivered by
Element Analytics and reduces the time to value by 80%.
Integration with Cloud-Based Applications
The Element Platform
Asset Data Data Advanced

Modeling Transformation Analytics
Microsoft Azure Cloud
On Premise
Operations Data
Figure 5: Leveraging Element Analytics for accelerating implementation of data structure
Data scientists need to consider not only the context for business users who are not IT specialists or data
of data, but also the data preparation. This is where scientists. Companies like Cemex have shown that
most of the effort can be spent. Data scientists a traditional time-series data preparation that would
need to prepare the data by selecting the data take six months before its ready for analytics can
set, cleansing the data, aligning data in time and be reduced to four minutes of preparation time with
formatting it in the correct layout. This is the greatest the right tools. This agile and user-friendly way of
challenge to data scientists looking to utilize working with the OSIsoft toolkit will reduce the time to
time-series data for advanced analytics. Agile value for business ideas significantly. In addition, less
self-service data preparation tools like the OSIsofts involvement is needed from IT specialists and data
Business Integrators, in combination with tools like scientists, and it reduces the Total Cost of Ownership
Element Analytics, help to open up big data analytics (TCO) for the same business value significantly.

CONCLUSION
T
he perfect world for industrial data processing does not exist yet. Pushing all production and operational
data in a raw format to a central big data store will result in a data swamp instead of a data lake. Only
specialized data scientists will be able to make sense out of the data. In an industrial environment,
pre-processing of all real-time data is essential. Bringing context to the data is a must to assure that business
users can leverage the data to optimize operations.
This means that in an industrial environment Enterprise application development and reporting
the combination of a data lake with a real-time is enabled by having a One-Stop-Shop for data
infrastructure will bring all the benefits of Big Data with a standardized data model for all assets
processing like: Operational staff will have direct access to
real-time data in a structured and agile way to
Connectivity to the very diverse production and optimize day-to-day operations
automation world
Data Scientists will be able to find the big value
items by combining all the data
Figure 6: An Enterprise Operations Infrastructure provides the foundation to ensure analytics-ready data for data initiatives.

The seamless integration of OSIsofts In addition, there is no need for software development
and complex IT infrastructure to run the PI System
PI System on the production and which is built on a self-service model. This reduces
automation level (interfaces and the need for large (costly) IT teams to make an
OSIsofts PI System implementation a success. The
connectors) and the seamless
majority of business innovation can be done by key
integration on the Business business users (subject matter experts) themselves.
Intelligence level with cloud and data Simple integration, no additional development, and
simplicity in use will drive down the TCO for this type
lake integration makes the OSIsofts of infrastructure significantly. The combination of
PI System infrastructure a very popular OSIsofts PI System, with capabilities supported by
vendors like Element Analytics, for data modeling and
product to bridge the gap between
analyses, and the integration of all enterprise data in
production and data lakes. a data lake platform, will provide an environment for
easy implementation and fast realization of value from
big data processing.

ABOUT KSG SOLUTIONS
KSG-Solutions is a service and consultancy

company with focus on industrial information systems.
KSG-Solutions is founded with the objective to help
industrial companies to generate more value out of
their installed assets by implementing smart solutions
based on off-the-shelve IT products. These solutions
will help to drive higher asset availability, increased
integrity, lower energy consumption, and higher overall
productivity. 40+ years of experience with Oil & Gas
majors and 30+ years experience in Real-Time data
processing and MES systems, forms the basis for the
services provided by KSG-Solutions.
For information, please visit our website at

www.ksg-solutions.nl
Copyright 2017 KSG Solutions

All companies, products, and brands mentioned are trademarks of their respective trademark owners. WPLSEN-102617

Data Lake or Data Swamp?

Uploaded by

Copyright:

Available Formats

Data Lake or Data Swamp?

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Lake or Data Swamp?

Uploaded by

Copyright:

Available Formats

Data Lake or Data Swamp?

Keeping the Data Lake

KSG Solutions Data Lake or Data Swamp? 1

Increasingly, businesses of all kinds are beginning to see their data

ABOUT THE AUTHOR

The paper is focused on helping industry leaders understand

KSG Solutions Data Lake or Data Swamp? 2

KSG Solutions Data Lake or Data Swamp? 3

Figure 1 : Hybrid Environment

KSG Solutions Data Lake or Data Swamp? 4

Data Warehouse vs. Data Lake

schema-on-write Processing schema-on-read

highly agile, configure and reconfigure

mature Security maturing

business professionals Users data scientists et. al.

Table 1: Data Warehouse vs. Data Lake

KSG Solutions Data Lake or Data Swamp? 5

Figure 2: Perfect World of Data Processing. Whats Missing?

KSG Solutions Data Lake or Data Swamp? 6

Figure 3: Overview of system characteristics

KSG Solutions Data Lake or Data Swamp? 7

BENEFIT OF INTEGRATING AND architecture with a consistent way of accessing

In the Energy world, the use of heat exchangers is quite common.

KSG Solutions Data Lake or Data Swamp? 8

KSG Solutions Data Lake or Data Swamp? 9

Table 2: Comparison of Infrastructure Capabilities

KSG Solutions Data Lake or Data Swamp? 10

KSG Solutions Data Lake or Data Swamp? 11

Figure 4: Streaming Operational Data to Multiple Applications

KSG Solutions Data Lake or Data Swamp? 12

Integration with Cloud-Based Applications

The Element Platform

Asset Data Data Advanced

Figure 5: Leveraging Element Analytics for accelerating implementation of data structure

KSG Solutions Data Lake or Data Swamp? 13

KSG Solutions Data Lake or Data Swamp? 14

KSG Solutions Data Lake or Data Swamp? 15

KSG-Solutions is a service and consultancy

For information, please visit our website at

Copyright 2017 KSG Solutions

You might also like