Data Lake or Data Swamp?
Data Lake or Data Swamp?
Data Lake or Data Swamp?
J
ohn de Koning, success advisor in industrial data processing, created his roots
in the oil and gas industry. As a technology and innovation manager for Shell,
John was focused on generating $500+ million value annually by introducing
innovative ways of processing manufacturing and production data. He became
an industry leader by introducing architectures to contextualize, integrate and
aggregate manufacturing and production data at a corporate level. The experience
and understanding gained has been used as the foundation for this white paper.
D
ata lakes are a simple way for businesses to collect and store raw data from a variety of inputs, without
having to know in advance exactly how the data will be used. But in order for data to drive business
outcomes, it must be organized and accessible. Without structure, the data lake becomes a swamp.
A variety of advanced real-time software systems are available that integrate with enterprise data lake software
and can help collect and structure data, so it can be used effectively.
Various solutions are available for processing The solution architecture for time-series data should
time-series data. Some of them even pretend to be follow a few strict rules:
the Holy Grail for data management and advocate
streaming sensor and machine data straight to a data 1. Connectivity
lake and the cloud and suggest organizing it later. Ensure the corporate solution is able to connect to
But what about the nature of industrial data streams the variety of (legacy) data sources and potential
and the legacy automation equipment that is already new sources.
out there? Especially in the area of industrial data
2. Time-Series Capacity
environment, automation systems have a life cycle up
to 20 years and replacement is a serious investment. The system should be able to deal with
Sending the raw data from these sources to the data time-series data (high fidelity, time indexing, and time
lake is not even an option as interfaces for these synchronization).
legacy sources do not exist. 3. Context
Systems should have easy-to-understand asset/
Access to the data should be equipment-based relations between the individual
simple and affordable, but still data streams, to enable business users to easily
enable enterprise wide reporting compare, view and analyze data on an equipment
level without being an IT specialist or data scientist.
and analyses.
4. Accessibility
Process users should be able to analyze and
visualize the data to help optimize the use of
production facilities.
5. Security
Keep your production facility safe and secure!
Dont allow unintended back-door access to your
automation system.
Various solutions are available in the combination of Large companies, like global energy enterprises,
real-time infrastructure and data lake technologies. have proven this technology combination can easily
Based on the above rules, the technology combination drive $500+ million benefit per year by introducing
of OSIsoft PI System toolkit with supporting Data enterprise tools and processes for Proactive
Context Automation tools, like those delivered by Monitoring, Exception Based Surveillance, Rotating
Element Analytics, is a leading strategy for a solution Equipment Monitoring, Condition Based Maintenance,
architecture that supports both the time-series data Margin visualizing, etc. All of these will result in better
needs within operations and enterprise data lake uptime and higher efficiency of the facilities.
initiatives. Dedicated integration tools are available
to easily integrate with Enterprise Data Warehouses
and data lake technologies from Microsoft, SAP, or
Hadoop, both in the cloud or on premise.
T
raditional data warehouse technologies use predefined data models to describe the database. The advantage
is that you know upfront what the data structure looks like. The downside is that data warehouses are
inflexible. A traditional data warehouse cannot keep up with rapid changes in the data model due to the
proliferation of new data sources and new questions people want to ask of the data. This overwhelming rate of
change is preventing the traditional way of working by first building a data model and a database schema. In
addition, the traditional way of (data) change management will not work anymore, as version control will be hard
with a fast changing data model.
In a data lake environment, raw data is pushed to the possible for business professionals (non-IT) to query
store in their original state. This can be structured, the data as the complex data modeling when it was
unstructured, blobs, etc. Instead of predefining how already done upfront by IT specialist. In the case of a
the data elements are related to each other (data data lake, you need to be a data scientist to be able
model), as with a data warehouse, you create the to analyze the various chunks of data and link them
relationships once you need to retrieve the data from together to make sense. Table 1 summarizes key
the data lake. This is also the major downside of a characteristics of data warehouses versus data lakes.
data lake. With databases and warehouses, it was
expensive for large data volumes Storage designed for low-cost storage
T
he perfect world is very simple. You want to have access to all the data that is available (internal and
external), query the data in any combination, run integrated analytics to find the missing pieces and visualize
the information you are looking for with the tool of your preference. However, the reality is often different.
When combined with a real-time, time-series environment, the core concerns are related to the diversity of (legacy)
data sources, network latency and reliability, data latency, time synchronization of data streams, and the context or
relationship between data streams.
A
hybrid model delivered by an ecosystem of suppliers will help to bridge the gap between the Perfect World
and the technology constraints.
Depending on the company size and the equipment data reliability concerns. Remote facilities connected
used for production, the variety in time-series data via low bandwidth connections, like satellites, need
sources can be significant. There will be a legacy of additional functionality to avoid data loss. Another
control and automation systems, especially in older important aspect is security. To assure the integrity
companies with various production locations that may and safe operations of your facility, the interface
include systems from various brands, various types per technology must be very secure. The following table
brand and various versions per type. Sending the raw shows the benefits of adding advanced real-time,
data from these sources to the data lake is not even time-series systems to the hybrid model to address to
an option as interfaces for these legacy sources do not key concerns from the data lake technology.
exist. Also, the facility location can introduce significant
T
he combination of data lake technology and time-series infrastructure will help to address the core concerns
of the Perfect World. In this situation, time series data infrastructure will collect all the data from the field.
The time-series data infrastructure will also assure the availability of data in the field for local viewing,
processing and reporting (Edge computing) or feeding data to (near) real-time optimization or advanced control.
This Edge computing will assure the data and system availability needed to run and monitor the equipment in the
production process itself by avoiding network availability and data latency issues.
I
n order for data to drive business outcomes, it must be organized and accessible. Without structure, the data
lake becomes a swamp. Individual data points have value for engineers very close to the production facility.
Engineers usually know in detail how the facility is built and how to find each data point. However, as soon as
reporting, monitoring or analyses happen outside of the local environment, it becomes important to add structure,
governance, and context to the huge amount of available data points. Knowing the data individually by name is
not an option anymore.
Example: Consider the contextual data that surrounds pump resides, where in a process, and what may
a single lube oil pump in a large facility. Each flow through the pump. Given the diversity of pumps
pump will have a parameter for pump name, power and their various applications and processes, simply
consumption, outlet pressure, outlet flow, outlet comparing all pumps is meaningless for analytics
temp, and filter differential pressure. Furthermore, without this context.
anyone in the organization should know where the
In the case of OSIsofts PI System, a toolkit is available to automate and significantly reduce the
effort needed to build templates and map data streams into a structure. This toolkit is delivered by
Element Analytics and reduces the time to value by 80%.
Data scientists need to consider not only the context for business users who are not IT specialists or data
of data, but also the data preparation. This is where scientists. Companies like Cemex have shown that
most of the effort can be spent. Data scientists a traditional time-series data preparation that would
need to prepare the data by selecting the data take six months before its ready for analytics can
set, cleansing the data, aligning data in time and be reduced to four minutes of preparation time with
formatting it in the correct layout. This is the greatest the right tools. This agile and user-friendly way of
challenge to data scientists looking to utilize working with the OSIsoft toolkit will reduce the time to
time-series data for advanced analytics. Agile value for business ideas significantly. In addition, less
self-service data preparation tools like the OSIsofts involvement is needed from IT specialists and data
Business Integrators, in combination with tools like scientists, and it reduces the Total Cost of Ownership
Element Analytics, help to open up big data analytics (TCO) for the same business value significantly.
T
he perfect world for industrial data processing does not exist yet. Pushing all production and operational
data in a raw format to a central big data store will result in a data swamp instead of a data lake. Only
specialized data scientists will be able to make sense out of the data. In an industrial environment,
pre-processing of all real-time data is essential. Bringing context to the data is a must to assure that business
users can leverage the data to optimize operations.
This means that in an industrial environment Enterprise application development and reporting
the combination of a data lake with a real-time is enabled by having a One-Stop-Shop for data
infrastructure will bring all the benefits of Big Data with a standardized data model for all assets
processing like: Operational staff will have direct access to
real-time data in a structured and agile way to
Connectivity to the very diverse production and optimize day-to-day operations
automation world
Data Scientists will be able to find the big value
items by combining all the data
Figure 6: An Enterprise Operations Infrastructure provides the foundation to ensure analytics-ready data for data initiatives.