Big Data Processing and Analytics Platform Architecture For Process Industry Factories
Big Data Processing and Analytics Platform Architecture For Process Industry Factories
Big Data Processing and Analytics Platform Architecture For Process Industry Factories
cognitive computing
Article
Big Data Processing and Analytics Platform
Architecture for Process Industry Factories
Martin Sarnovsky * ID
, Peter Bednar and Miroslav Smatana
Department of Cybernetics and Artificial Intelligence, Technical University Kosice, Letna 9,
04001 Kosice, Slovakia; peter.bednar@tuke.sk (P.B.); miroslav.smatana@tuke.sk (M.S.)
* Correspondence: martin.sarnovsky@tuke.sk
Abstract: This paper describes the architecture of a cross-sectorial Big Data platform for the process
industry domain. The main objective was to design a scalable analytical platform that will support
the collection, storage and processing of data from multiple industry domains. Such a platform
should be able to connect to the existing environment in the plant and use the data gathered to
build predictive functions to optimize the production processes. The analytical platform will contain
a development environment with which to build these functions, and a simulation environment
to evaluate the models. The platform will be shared among multiple sites from different industry
sectors. Cross-sectorial sharing will enable the transfer of knowledge across different domains.
During the development, we adopted a user-centered approach to gather requirements from different
stakeholders which were used to design architectural models from different viewpoints, from
contextual to deployment. The deployed architecture was tested in two process industry domains,
one from the aluminium production and the other from the plastic molding industry.
Keywords: Big Data analytics; Big Data architecture; process industries; predictive data analysis
1. Introduction
Process industries represent a significant share of the European industry in terms of energy
consumption and environmental impact. In this area, optimization can lead to significant savings,
both economic and environmental. Predictive modeling can prove effective when applied to
the optimization of production processes. However, the application of these techniques is not
straightforward. Predictive models are built using the data obtained from production processes.
In many cases, process industries must invest in the monitoring and data integration as well as in the
development and maintenance of the underlying infrastructure for data analytics. Many other obstacles
are also present, e.g., interoperability issues between software systems in production, difficulties in the
physical monitoring of the production parameters, problems with the real-time handling of the data, or
difficulties in defining relevant Key-Performance Indicators (KPIs) to support management. Therefore,
the deployment of such predictive functions in production with reasonable costs requires consolidation
of the available resources into shared cloud-based technologies. In the case of more flexible production
environments, even more significant approaches are possible, such as the reinvention or redesign of
the production processes. However, this is not applicable to major, capital intensive process industries.
In this case, the integration of innovations in the established production processes can be fundamental
in their transformation from resource-consuming production into the “circular” model.
The work presented in this paper was performed as part of the MONSOON (MOdel based control
framework for Site-wide OptimizatioN of data-intensive processes) project (https://www.spire2030.
eu/monsoon), focused on sectors of the aluminium and plastic industries. The main objective was to
pursue process optimization from the perspective of raw materials and energy reduction, through the
application of optimizations distributed across multiple production units within multiple distributed
production sites.
This paper aims to introduce a cross-sectorial, scalable, Big Data analytics platform for such
process industries, which is capable of storing and processing large amounts of data from multiple
sites. Besides cost reduction, sharing the platform for multiple sites from different sectors will enable
the transfer of the best practices and knowledge across different domains. The paper is organized
as follows: Section 2 gives an overview of the related research and projects. Section 3 gives an
outline of the design methodology and specification of the user requirements. Sections 4–6 present the
architectural views and describe the main components of the platform from different viewpoints.
2. Related Work
In the process industries, production processes must be accurately modelled to achieve
optimization goals set by the companies. Modeling using predictive models allows companies to
adapt to the changing requirements as the process models are implemented in an environment
that provides further knowledge through simulation and optimization. The traditional approach is
based on mathematical modeling complemented with statistical modeling or data-driven empirical
methods [1–3]. To reduce computational effort, various methods can be used [4–6]. Process analytical
technologies are widely used, and data analysis plays an important role in any modern process
industry, mainly in analyzing and controlling manufacturing processes and attributes of the materials
used. In this case, real-time optimization can be an effective approach to the improvement of processes
and operation. A model predictive control strategy, machine learning techniques, self-optimizing
and control mechanisms [7–10] are useful tools in the construction, adaptation and application of
real-time optimization methods applied in the process industries. Several projects are also aimed at
using machine learning and predictive functions in the process industries. FUDIPO [11] integrates
the machine learning functions into the process industries to provide improvements in energy and
resource efficiency. In OPTICO [12], a partial least squares model is used to predict the mean particle
size of polystyrene beads based on the collected data. Another example is ProPAT [13], where
production optimization is achieved by extracting information and knowledge using data analytic tools.
ProPAT also aims to develop novel and smart sensors for measuring process parameters, and integrate
them into a global control platform for data acquisition, processing and mining. The platform provides
self-learning and predictive capabilities aimed at reducing over-costs derived from the deviations from
the optimum process. CONSENS [14] introduces online sensing equipment and control of the key
product parameters in the production of high-value products with high-quality demands. The COCOP
project [15] will enable plant-wide monitoring and control by using the model-based, predictive,
coordinating optimization concept with integration of the plant’s automation systems.
From the perspective of existing platforms and technologies capable of handling of big volumes
of data, numerous technologies are present that address various aspects of Big Data processing [16].
In areas of data ingestion, several technologies exist including Sqoop, Flume or Kafka. MQTT (MQ
Telemetry Transport or Message Queuing Telemetry Transport) is a widely used, simple and
lightweight messaging protocol [17]. The processing frameworks have been grouped according
to the state of the data they are designed to handle. Some systems handle data in batches, while
others process data in a continuous stream as it flows into the system. Batch processing involves
operating over a large, static dataset and returning the result later when the computation is complete.
Most popular is Hadoop, based on MapReduce—a programming model that allows the processing of
Big Data in a parallel and distributed way in a cluster. Stream processing compute over data as it enters
the system. This requires a different processing model than the batch paradigm. Current architectures
of Big Data processing platforms require technologies that can handle both batch and stream
workloads [18]. These frameworks simplify diverse processing requirements by allowing the same
or related components and application programming interfaces (APIs) to be used for both types of
data. Apache Spark [19] is a next generation batch processing framework with stream processing
Big Data Cogn. Comput. 2018, 2, 3 3 of 15
capabilities. Built using many of the same principles as Hadoop's MapReduce engine, Spark focuses
primarily on speeding up batch processing workloads by offering full in-memory computation and
processing optimization. Apache Flink is an open source framework for distributed stream processing.
Flink uses the concepts of streams for all applications. In Flink’s terms, a batch is a finite set of
streamed data. Regarding streaming, Flink processes data streams as true streams, i.e., data elements
are immediately pipelined through a streaming program as soon as they arrive, which distinguishes
Flink from Spark (which performs microstreaming). To enable machine learning from Big Data, several
tools are available. Spark MLlib is a machine learning library of Spark. The library comes with machine
learning algorithms for classification, regression, clustering, and collaborative filtering. FlinkML is
the machine learning library for Apache Flink. FlinkML has a scikit-inspired pipeline mechanism
which allows the data scientists to quickly build complex data analysis pipelines the way they appear
in the daily works of data scientists. H2 O (0xdata, Mountain View, USA) is another open source
software for Big Data analysis. The machine learning library scales statistics, machine learning and
mathematics over Big Data. Currently popular frameworks are TensorFlow and Keras. TensorFlow
is a symbolic math library used for a wide array of machine learning and other applications, which
is platform agnostic and supports both CPUs and (multiple) GPUs. Keras is a machine learning
framework for Python built on top of abstractions of other back-end libraries (e.g., TensorFlow, Theano,
Deeplearning4j, CNTK, MXNET).
Those data were gathered and analyzed, and led to the identification of the first set of requirements.
Based on those requirements, we created a draft of the architectural description based on which the
first prototype was created. We decided on the following viewpoints from which the architectural
views were derived:
4. Context View
The context viewpoint describes interactions, relationships and dependencies between the system
and its environment which interact with the system, such as other systems, users, or developers.
The proposed architecture was designed in the context of the MONSOON, a SPIRE (Sustainable
Process Industry through Resource and Energy Efficiency) research project that aims to develop an
infrastructure in support of the process industries. Its main objective is to establish the data-driven
methodology which will support the identification and exploitation of optimization potentials by
applying model-based predictive controls in the production processes.
To validate and demonstrate the results, two real environments are used within the project: an
aluminium plant in France and a plastic factory in Portugal. We have identified two main use cases for
both domains. For the aluminium sector, we focused on production of the anodes (positive electrodes)
used in aluminium extraction by electrolysis. The first use case was targeted to predictive maintenance,
where the main objective was to anticipate the breakdowns and/or highlight equipment/process
deviations that impact the green anode final quality (e.g., anode density). The second case dealt with
the predictive anode quality control, where the goal was to identify bad anodes with a high level of
confidence and scrap them to avoid sending them to the electrolysis area.
For the plastic domain, the first case was from the area of production of coffee capsules, produced
in large quantities with little variation and relatively low quality specifications. In this type of
production, it is important to produce the correct diameter and height of the coffee capsules and to
make sure that the holes at the bottom of the capsules are formed properly. The second use case
covered the production of the parts used in the automotive industry, where methods of over-molding
metal inserts are applied. Based on the identified use cases, we divided the data analytics architecture
of two main components as shown in Figure 1:
• Real Time Plant Operation Platform—used during runtime, can be used by employees working on
the shop floor. The component communicates with the existing heterogeneous systems deployed
on the production site including sensors or systems such as ERP (Enterprise resource planning),
SCADA (Supervisory control and data aquisition), MES (Manufacturing execution system) and
others. Relevant data from the production site are transferred to the Cross-Sectorial Data Lab.
• Cross-Sectorial Data Lab—a collaborative environment where high amounts of data from multiple
sites are collected and processed in a scalable way. Designed to be used by the data scientists
or the global process manager. It consists of Big Data storage and processing elements. It also
contains development tools for the creation of predictive functions, simulation tools for evaluation
of those functions in the testing environment and their deployment in the production site, and a
semantic framework which provides a common language between the data scientists and domain
experts (e.g., process managers).
Big Data Cogn. Comput. 2018, 2, 3 5 of 15
The components are connected by the specified interfaces. The components and corresponding
Big Data Cogn. are
interfaces described
Comput. 2018, 2, xin more
FOR detail
PEER in the Functional View section.
REVIEW 5 of 15
Real-time Plant 1
Operational Predictive Real-time Plant 3
Platform Functions Operational
Platform
Operational Data
Cross-Sectorial
Sector B (e.g. plastic) Data Lab Real-time Plant 4
Platform Operational
Platform
Real-time Plant 2
Operational
Platform
Site Environment
Predictive Functions
Prediction
Enterprise ERP (from Data Lab
signals
Platform)
Management MES Real-time
Plant
Supervision SCADA
Operational
Control PLCs Platform
Operational Operational Data
Field Sensors/actuators Data (to Data Lab
Platform)
High level
Figure1.1. High
Figure level architecture
architectureoverview. Abbreviations:
overview. ERP,ERP,
Abbreviations: Enterprise resource
Enterprise planning;
resource SCADA,
planning;
Supervisory
SCADA, control and
Supervisory data acquisition;
control MES, Manufacturing
and data acquisition; execution system;
MES, Manufacturing PLC,system;
execution Programmable
PLC,
Logic Controller.
Programmable Logic Controller.
5.5.Information
InformationView
View
The
TheInformation
InformationViewViewisisformalized
formalizedininthe theform
formofofthe
theprocess
processmodels
modelsandandsemantic
semanticmeta-data
meta-data
models
modelsorganized
organized inin
the Semantic
the Semantic Modelling
Modelling Framework.
Framework. TheThe
main goal
main of the
goal semantic
of the semanticmodeling
modelingis
toisprovide a common
to provide a commoncommunication
communication language between
language domain
between experts
domain and stakeholders
experts and data
and stakeholders and
scientists. On theOn
data scientists. onethehand,
one data
hand,scientists need aneed
data scientists deepa knowledge
deep knowledgeof the of
business objectives
the business and
objectives
modelled phenomena
and modelled phenomenaacquired from from
acquired the stakeholders
the stakeholders andand domain
domain experts. OnOn
experts. thetheother
otherhand,
hand,
stakeholders
stakeholdersandanddomain
domainexperts
expertsneed
needtotointerpret
interpretthetheresults
resultsofofthe
thedata
dataanalysis.
analysis.
The
TheSemantic
SemanticModelling
Modelling Framework
Framework combines
combines concepts
conceptsfrom the the
from IEC/ISO
IEC/ISOstandards for the
standards for
enterprise-control system
the enterprise-control integration
system withwith
integration concepts from
concepts thethe
from data
datamining
miningmethodologies
methodologiesand and
interoperability
interoperabilitystandards suchas
standards such asCross-Industry
Cross-Industry Standard
Standard Process
Process for Data
for Data Mining
Mining (CRISP-DM)
(CRISP-DM) [26,27],
[26,27], Predictive
Predictive ModelModel
Markup Markup
Language Language
(PMML) (PMML) [28],
[28], and and Portable
Portable FormatFormat for Analytics
for Analytics (PFA)
(PFA) [29].
[29]. The overall structure of the Semantic Modelling Framework is presented in Figure 2 and covers
the The overallmain
following structure of theProduction
concepts: Semantic Modelling
Processes Framework
and ProcessisSegments,
presentedEquipment
in Figure 2and and Physical
covers
the following
Assets, main concepts:
Key-Performance Production
Indicators, Data Processes
Elementsand andProcess Segments,
Predictive Equipment and Physical
Functions.
Assets, Key-Performance Indicators, Data Elements and Predictive Functions.
Big Data Cogn. Comput. 2018, 2, 3 6 of 15
Big Data Cogn. Comput. 2018, 2, x FOR PEER REVIEW 6 of 15
KPI Function
Production
KPI
Process
Production
Data Element
Segment
Predictive
Function
Figure2.2. The
Figure The main
main concepts
concepts ofof the
theSemantic
SemanticModelling
ModellingFramework.
Framework.Abbreviation: KPI, KPI,
Abbreviation: Key-
Performance Indicator.
Key-Performance Indicator.
5.3.Key-Performance
5.3. Key-PerformanceIndicators
Indicators
Theconcepts
The concepts of KPIs
of KPIs represent
represent metricsmetrics
designed designed to visualize,
to visualize, assess and
assess and manage manage the
the performance
orperformance or impact
impact of specific of specific
operations within operations
the process within the process
industries. They areindustries.
linked to They are linked
the Process to the
Segments,
Process Segments, or specified for the overall Production Process. The main properties
or specified for the overall Production Process. The main properties of the KPI metrics (e.g., type and of the KPI
metrics (e.g., type and quantity) are inherited from the Logical Data Elements (see
quantity) are inherited from the Logical Data Elements (see the description in the following subsection). the description in
theKPIs
The following
can besubsection).
grouped into TheKPIKPIs can be grouped
categories accordinginto
toKPI
the categories according
classification to the[30].
in ISO 22400 classification
in ISO 22400 [30].
Besides this classification, causal relations between KPIs can be expressed by the KPI Functions,
whichBesides
transform thisone
classification, causal
KPI to another relations
(e.g., between
material KPIssavings
or energy can be expressed by the KPI
can be converted intoFunctions,
KPIs for
which
the transform one
environmental KPI to
impact, another
such (e.g.,ofmaterial
as level or energy
emissions). savings
The input of can be converted
the KPI function into KPIs
can be for
also
the environmental impact, such as level of emissions). The input of the KPI
Evaluation Metrics, which estimate the performance of the predictive functions (see the description function can be also
ofEvaluation Metrics,
the Predictive which in
Functions estimate the performance
the subsequent of theUsing
subsection). predictive functions (see
the composition the Functions,
of KPI description
it is possible to infer the impact of the deployment of a predictive function on the performance ofit
of the Predictive Functions in the subsequent subsection). Using the composition of KPI Functions,
is possible
the production to infer the impact
process, where theof the deployment
performance canofbe
a predictive function
evaluated from on the
various performance
perspectives of as
such the
production process,
material/energy where the
consumption, performance
product quality, can be evaluated
environmental from etc.
impact, various perspectives such as
material/energy consumption, product quality, environmental impact, etc.
6. Functional View
There are two main components of the architecture: the Data Lab as the platform for data storage
and processing and the Plant Operation Platform deployed on-site and providing a connection to the
production environment.
The Data Lab platform is connected to the Plant Operation Platform using the following interfaces:
• Data Replication Service—This interface allows the uploading of a large batch of historical data
collected on the site to the cloud Data Lab storage.
• Messaging Service—This interface allows real-time asynchronous communication between the Data
Lab and Operational platform. It is optimized for frequently updated data (e.g., from sensors)
with a relatively small size of the payload for each update.
• Functions Repository—This interface allows the export of predictive functions built on the Data Lab
platform and their deployment on the Plant Operation Platform for scoring of the operational data.
All interfaces are provided by the Data Lab platform with Plant Operation Platform components
acting as clients.
Operation Data
Visualization
Framework
Run-time Container
IoT Scoring API
Protocols
Virtual Process
Web Industries
Services Resources Adapter
Figure 3. Plant
Figure 3. Plant Operational
Operational Platform
Platform architecture. Abbreviations: IoT,
architecture. Abbreviations: IoT, Internet
Internet of
of Things;
Things; RDBMS,
RDBMS,
relational database management system; NoSQL, Non Structured Query Language; API, application
relational database management system; NoSQL, Non Structured Query Language; API, application
programming
programming interface.
interface.
TheOperation
The OperationData DataVisualization
VisualizationFramework
Frameworkprovides
providesa aweb
webuser
userinterface
interfacewhere
whereoperational
operational
managers can configure various real-time visualizations of operational data
managers can configure various real-time visualizations of operational data and monitor and monitor the deployed
the
predictive
deployed functions.
predictive The visualized
functions. data are data
The visualized integrated by the Virtual
are integrated by theProcess
Virtual Industries Resource
Process Industries
AdapterAdapter
Resource and canand include operational
can include data from
operational the plant
data from environment
the plant environmentor predictions from
or predictions fromthe
predictive functions executed in the Run-time Container. The Virtual Industries
the predictive functions executed in the Run-time Container. The Virtual Industries Resource Resource Adapter
streams streams
Adapter the visualized data asynchronously
the visualized through thethrough
data asynchronously REST web theservice
RESTinterface provided
web service by the
interface
Operational
provided by theData Visualization
Operational Framework.
Data TheFramework.
Visualization visualized data
Thecan be enhanced
visualized data with theenhanced
can be new trend
indicators
with the newusing various
trend trend analysis
indicators using methods. The Operational
various trend Data Visualization
analysis methods. Framework
The Operational also
Data
provides APIFramework
Visualization for the implementation
also providesofAPI
thefor
new thetrend indicators. of the new trend indicators.
implementation
6.2.Cross-Sectorial
6.2. Cross-SectorialData
DataLab
LabPlatform
Platform
Thearchitecture
The architectureofofthe
theCross-Sectorial
Cross-Sectorial Data
Data Lab
Lab platform
platform isisshown
shownininFigure
Figure4 4and
andconsists of of
consists the
following main components: Big Data Storage and Analytics Platform; Development
the following main components: Big Data Storage and Analytics Platform; Development Tools; Tools; Semantic
Modelling
Semantic Tools and Tools
Modelling Simulation
and and Resourceand
Simulation Optimization
Resource Framework.
OptimizationThese components
Framework. Theseare
described inare
components more detail in
described inthe following
more detail insubsections.
the following subsections.
Semantic
Model API
Data
Analytics API
Simulation and Resource
Functions Optimization Framework
Repository
Big Data Storage and
Analytics Platform
Messaging
Service
Development Tools
SQL API
Data
Replication
Figure
Figure 4. 4.
TheThe architecture
architecture ofof the
the Cross-SectorialData
Cross-Sectorial DataLab
Labplatform.
platform.
6.2.1.
6.2.1.Big
BigData
DataStorage
Storageand
andAnalytics
AnalyticsPlatform
Platform
The
TheBigBigData
DataStorage
StorageandandAnalytics
AnalyticsPlatform
Platformprovides
providesresources
resourcesand andfunctionalities
functionalitiesfor forstorage
storage
asaswell
wellasasfor
forbatch
batchand
andreal-time
real-timeprocessing
processingofofthetheBig
BigData.
Data.ItItprovides
providesmain
mainintegration
integrationinterfaces
interfaces
between
betweenthethe site Operational
OperationalPlatform
Platform andand
thethe cloud
cloud DataData Lab platform
Lab platform and theand the programming
programming interfaces
interfaces for the implementation
for the implementation of the
of the data data mining
mining processes.
processes. The internal
The internal structure
structure of the
of the Big BigStorage
Data Data
Storage and Analytics
and Analytics PlatformPlatform
is givenisingiven in Figure
Figure 5. 5.
Data
Dataareare
primarily stored
primarily in thein
stored Distributed File System,
the Distributed File which
System, is responsible
which is for the distribution
responsible for the
and replicationand
distribution of replication
large datasets acrossdatasets
of large the multiple
acrossservers (data nodes).
the multiple servers A unified
(data access
nodes). A to the
unified
structured
access to data is provideddata
the structured by the Distributed
is provided byDatabase using the
the Distributed standardusing
Database SQL (Standard
the standard Query
SQL
Language)
(Standardinterface. The maininterface.
Query Language) component Theresponsible for dataresponsible
main component processingforis the
dataDistributed
processingData is the
Processing
Distributed Framework, which Framework,
Data Processing provides high-level API for high-level
which provides the implementation
API for the of implementation
the data pre-
processing
of the data tasks and for the building
pre-processing tasks andandforvalidation of theand
the building predictive functions.
validation of the Predictive
predictivefunctions
functions.
are stored infunctions
Predictive the Functions Repository,
are stored where they
in the Functions are available
Repository, where forthey
production deployment
are available or for
for production
the simulationsorand
deployment overall
for the optimization
simulations of the production
and overall optimization processes. The rest ofprocesses.
of the production the components
The rest
(Messaging Service and Data Replication Service) provide data communication interfaces, and
connect the Operational platform to the Data Lab platform.
Big Data Cogn. Comput. 2018, 2, 3 10 of 15
of the components (Messaging Service and Data Replication Service) provide data communication
interfaces,
Big Data Cogn.and connect
Comput. the
2018, 2, Operational
x FOR platform to the Data Lab platform.
PEER REVIEW 10 of 15
Functions
Repository
Functions Data Analytics
Export API
Data
Replication
Service I/O API
Figure 5. The internal architecture of the Big Data Storage and Analytics Platform.
Figure 5. The internal architecture of the Big Data Storage and Analytics Platform.
The Big Data Storage and Analytics Platform consist of the following sub-components:
The Big Data Storage and Analytics Platform consist of the following sub-components:
• Distributed File System—provides a reliable, scalable file system with similar interfaces and
• Distributed File System—provides a reliable, scalable file system with similar interfaces and
semantics to access data as local file systems.
semantics to access data as local file systems.
• Distributed Database—provides a structured view of the data stored in the Data Lab platform
• Distributed Database—provides a structured view of the data stored in the Data Lab platform using
using the standard SQL language, and supports standard RDBMS programming interfaces such
the standard SQL language, and supports standard RDBMS programming interfaces such as
as JDBC for Java or ODBC for Net platforms.
JDBC for Java or ODBC for Net platforms.
• Distributed Data Processing Framework—allows the execution of applications in multiple nodes in
• order
Distributed Data Processing Framework—allows the execution of applications in multiple nodes in
to retrieve, classify or transform the arriving data. The framework provides Data Analytics
order to retrieve, classify or transform the arriving data. The framework provides Data Analytics
APIs for two main paradigms for processing large datasets: API for parallel computation and
APIs for two main paradigms for processing large datasets: API for parallel computation and API
API for distributed computation.
for distributed computation.
• Functions Repository—provides storage for predictive functions together with all settings
• required
Functions for
Repository—provides
the deployment ofstorage for predictive functions together with all settings required
functions.
• for the deployment
Messaging of functions.an interface for real-time communication between the Data Lab
Service—implements
• and Messaging Service—implements
Operation platforms. It provides an interface for real-timemessaging
a publish-subscribe communication
systembetween the Data
for asynchronous
Lab and two-way
real-time Operationcommunication,
platforms. Itwhich provides
allows a publish-subscribe
to the decoupling messaging system and
of data providers for
asynchronous
consumers. real-time two-way communication, which allows to the decoupling of data
• providers and consumers.
Data Replication Service—provides an interface for the uploading of the historical batch data
• between
Data Replication
the DataService—provides
Lab and Operation anplatform.
interface for the uploading of the historical batch data
between the Data Lab and Operation platform.
6.2.2. Development Tools
6.2.2. Development Tools
The Development Tools provide the main collaborative and interactive interface for data
The Development
engineers, data analystsTools provide
and data the main collaborative
scientists to execute and andinteractive interface
interact with thefor dataprocessing
data engineers,
data analysts
workflows and data
running onscientists
the DatatoLabexecute and interact
platform. Using with the data processing
the provided interface, workflows running
data scientists can
on the Data Lab platform. Using the provided interface, data scientists can organize,
organize, execute and share data, and code and visualize results without referring to the internal execute and share
data, and
details of code and visualize
the underlying results
Data withoutThe
Lab cluster. referring to the
interface is internal
integrateddetails
into of theform
the underlying Data
of analytical
Lab cluster. The
“notebooks” interface
where is integrated
different parts ofinto
thetheanalysis
form of are
analytical “notebooks”
logically grouped where different parts
and presented in oneof
the analysis are logically grouped and presented in one document. These notebooks
document. These notebooks consist of code editors for data processing scripts and SQL queries, and consist of code
editors for data
interactive processing
tabular scripts
or graphical and SQL queries,
presentations of the and interactive
processed data.tabular or graphical presentations
of the processed data.
6.2.3. Semantic Modelling Tools
The Semantic Modelling Tools provide a collaborative user interface for the creation and sharing
of semantic models specified in the Semantic Modelling Framework (see the description in the section
Information View). Additionally, the Semantic Modelling Tools provide a web service interface for
Big Data Cogn. Comput. 2018, 2, 3 11 of 15
7. Deployment View
Nowadays, a large number of various technologies for Big Data processing are emerging,
often with overlapping functionalities and non-functional properties. For the system architects,
the selection of implementation technology is a challenging task which requires the consideration of
many implementation details and compatibility constrains. The Deployment View provides consistent
mapping across the existing and emerging technologies, and functional components specified in
the Function View. Mapping reference implementation of the proposed architecture is summarized
in Table 1 for both architecture platforms (Plant Operation platform and Cross-Sectorial Data Lab
platform). All referenced technologies are available under an open source license without any usage
restrictions. Alternatively, the proposed technological stack can be implemented using technologies
from the main Big Data platform providers such as Cloudera, Hortonworks, IBM or Microsoft
HD Insight.
Big Data Cogn. Comput. 2018, 2, 3 12 of 15
In this section, we present the initial version of the platform. Our reference implementation
used Apache Nifi (scalable framework for data routing, transformation) as the main technology for
data integration in the Plant Operation Platform. Nifi was used for the collection of heterogeneous
data from various process industry sites and to store the processed data in the Data Lab platform.
The Run-time Container was based on the Hadrian scoring engine for the Portable Format for
Analytics (PFA, an emerging interchange format for data-mining models). Operational data were
visualized using the Grafana (https://grafana.com/), a visualization framework for analytics and
monitoring. The core components of the Cross-Sectorial Data Lab platform were based on the standard
Apache Hadoop (hadoop.apache.org) infrastructure, which consists of the distributed file system
(Hadoop Distributed File System, HDFS) and resource manager (Yet Another Resource Negotiator,
YARN). As the main framework for the implementation of the distributed data processing, Apache
Spark (https://spark.apache.org) was adopted, which supports both batch and stream processing.
Apache Spark also provided support for SQL data analytics. Support for parallel GPU/CPU
computation was based on the combination of DL4J (https://deeplearning4j.org), an open-source,
distributed deep-learning library, and Keras, a high-level neural networks API (https://keras.io).
The platforms were integrated using the Apache Kafka messaging system for real-time communication
and native HDFS web service interface for batch data updates. Access to the Cross-Sectorial Data
Lab platform was secured by Apache Knox (https://knox.apache.org) security gateway. Cluster was
managed using Apache Ambari (https://ambari.apache.org) management software and, to support the
development of the predictive functions, we adopted Apache Zeppelin (https://zeppelin.apache.org/)
as a basis for the Development tools component.
Figure 6 presents the Deployment View with the main types of the nodes. In order to simplify
the on-site deployment, all components of the Plant Operation Platform were installed in one Site
container server connected to the site infrastructure. This container can be deployed in the cluster for
scalability and reliability. The Cross-Sectorial Data Lab platform was deployed as the cloud cluster,
which consisted of three types of nodes. The security Gateway was the only server connected to
the Internet, and isolated Data Lab components, which were interconnected by the private network.
The Master ran the main services for the management, monitoring, configuration and provisioning
of the Data Lab cluster such as the Apache Ambari server, HDFS Name Node and YARN resource
manager. Data were stored and processed on Worker nodes, which ran execution services such as HDFS
Data Nodes and Spark workers.
Big Data Cogn. Comput. 2018, 2, 3 13 of 15
The presented, initial version of the platform was tested and evaluated in both considered domains.
The main objective was to deploy the Plant operation platforms on-site to collect and transfer the data
to the Cross-sectorial Data Lab. The main goal of the evaluation was to test the infrastructure for the
transfer of both types of data—transfer of batches of data and real-time data streaming. The main idea
was to provide a proof-of-concept, of how the platform would perform when handling both types of
data obtained from real on-site plant operations.
Big Data Cogn. Comput. 2018, 2, x FOR PEER REVIEW 13 of 15
Site
Container
Internet (TLS)
Private Data
Lab network
Figure 6.
Figure 6. Deployment
Deployment view
view with the main
with the main types
types of
of nodes.
nodes.
Data were integrated using Apache Nifi and transferred using the MQTT (MQ Telemetry
Data were integrated using Apache Nifi and transferred using the MQTT (MQ Telemetry Transport
Transport or Message Queuing Telemetry Transport) (http://mqtt.org) protocol to Data Lab, where
or Message Queuing Telemetry Transport) (http://mqtt.org) protocol to Data Lab, where they were
they were stored in an HDFS filesystem. In the aluminium domain, Historian Pi, a real-time data
stored in an HDFS filesystem. In the aluminium domain, Historian Pi, a real-time data historian
historian application, was used on-site and the Pi connector created one file per indicator (sensor)
application, was used on-site and the Pi connector created one file per indicator (sensor) every 15 min
every 15 min (the scheduled period is configurable) on a dedicated folder in the local filesystem.
(the scheduled period is configurable) on a dedicated folder in the local filesystem. Newly created files
Newly created files were then being sent using the Data Replication Service to the Data Lab storage
were then being sent using the Data Replication Service to the Data Lab storage component. In this
component. In this case, the platform transferred and integrated approximately 10,000 sensor values
case, the platform transferred and integrated approximately 10,000 sensor values from the production
from the production site every 15 min. Data were collected and replicated in the storage component.
site every
In the 15 min.
plastic Datathe
domain, were collected
data and streamed
files were replicatedinto
in the
thestorage component.
Data Lab In the service
via Messaging plastic domain,
as soon
the data files were streamed into the Data Lab via Messaging service as soon
as they were created, received by a subscriber, and converted to the uniform format. Then the dataas they were created,
received
were storedby aor subscriber,
pushed toandtheconverted to the
visualization uniform format.
component Then the
for real-time data were
analysis. Forstored or pushed
the real-time to
data,
the visualization component for real-time analysis. For the real-time data, the
the platform was used to stream more than 100 sensor values per second to the storage component. platform was used to
streamThe more than
initial 100 sensor
version of thevalues
platformper second to the storage
was successfully component.
deployed and put into the operation. Data
were being ingested and collected in the filesystem. The next and
The initial version of the platform was successfully deployed putinvolve
steps into thethe
operation. Dataofwere
expansion the
being ingested and collected in the filesystem. The next steps involve the expansion
platform with Development Tools, which will provide the data scientists with the environment to of the platform
with
model, Development
define, create, Tools,
and which
evaluatewilltheprovide the models
predictive data scientists with Lab,
in the Data the environment
on top of theto model,
gathered
define, create, and evaluate the predictive models in the Data Lab, on top
historical data. Validated models could be then packaged and deployed to the Plant Operation of the gathered historical
data. Validated
Platform into themodels could be then packaged and deployed to the Plant Operation Platform into
production.
the production.
8. Conclusions
8. Conclusions
In this paper, we described the complete specification of the architecture for Big Data processing
In this paper, we described the complete specification of the architecture for Big Data processing in
in the process industries. A user-centered approach to gathering user scenarios and requirements
the process industries. A user-centered approach to gathering user scenarios and requirements from
from different stakeholders was adopted for the specification of the architecture. Based on the user
different stakeholders was adopted for the specification of the architecture. Based on the user scenarios
scenarios and requirements, we specified the information and functional view of the architecture.
and requirements, we specified the information and functional view of the architecture. The Information
The Information View was formalized as a semantic model, which defined concepts for the modeling
of the production processes, data elements, predictive functions, and inference of the Key-
Performance Indicators. The Function View specifies the decomposition of architecture into modules
and components, and provides a detailed functional description for each specified component and
interface. From the functional point of view, the main design concept is the division of architecture
into two platforms: the Plant Operation platform, deployed on-site and providing connection to the
Big Data Cogn. Comput. 2018, 2, 3 14 of 15
View was formalized as a semantic model, which defined concepts for the modeling of the production
processes, data elements, predictive functions, and inference of the Key-Performance Indicators.
The Function View specifies the decomposition of architecture into modules and components, and
provides a detailed functional description for each specified component and interface. From the
functional point of view, the main design concept is the division of architecture into two platforms: the
Plant Operation platform, deployed on-site and providing connection to the production environment,
and the cloud Cross-Sectorial Data Lab platform for data storage and processing. The concept of the
Data Lab platform reduces costs associated with the implementation of the data analytics methods in
the production processes by consolidating the resources for multiple sites, and enables the sharing
and transfer of knowledge between different industry sectors. The description of the architecture
was completed with the Deployment View, which maps functional components of the technologies
for Big Data processing. The main objective of the reference implementation was to provide
consistent mapping of technologies with minimal dependencies, covering all functional requirements.
The reference architecture was implemented and tested on two domains: the production of aluminium
and the plastic molding industry. In the future, we will further focus on the standardization of the
semantic models and programming application interfaces.
Acknowledgments: The work presented in the paper is supported by H2020-SPIRE-2016 Project MONSOON
“MOdel based control framework for Site-wide OptimizatioN of data-intensive processes,” funded by the EC
within the Contract No. 723650 and by the Slovak VEGA research grant 1/0493/16 and Slovak APVV research
grant APVV-16-0213.
Author Contributions: P. Bednar designed the architecture of the platform; Peter Bednar, Miroslav Smatana
and Martin Sarnovsky implemented and deployed the initial version of the platform; Martin Sarnovsky and
Peter Bednar wrote the paper.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Xie, X.; Shi, H. Dynamic Multimode Process Modeling and Monitoring Using Adaptive Gaussian Mixture
Models. Ind. Eng. Chem. Res. 2012, 51, 5497–5505. [CrossRef]
2. Matzopoulos, M. Dynamic Process Modeling: Combining Models and Experimental Data to Solve Industrial
Problems. In Process Systems Engineering; Wiley-VCH Verlag GmbH & Co. KGaA: Weinheim, Germany, 2011;
pp. 1–33, ISBN 9783527631339.
3. Cameron, I.; Gani, R.; Cameron, I.; Gani, R. Chapter 6—Steady-State Process Modelling. In Product and
Process Modelling; Elsevier: Amsterdam, The Netherlands, 2011; pp. 125–156, ISBN 9780444531612.
4. Kondov, I.; Sutmann, G. Multiscale Modelling Methods for Applications in Materials Science; IAS Series;
Forschungszentrum Jülich, Zentralbibliothek: Jülich, Germany, 2013; ISBN 978-3-89336-899-0.
5. Weinan, E. Principles of Multiscale Modeling; Cambridge University Press: Cambridge, UK, 2011;
ISBN 9781107096547.
6. Sadowski, T.; Trovalusci, P. (Eds.) Multiscale Modeling of Complex Materials; CISM International Centre for
Mechanical Sciences; Springer: Vienna, Austria, 2014; Volume 556, ISBN 978-3-7091-1811-5.
7. Del Rio-Chanona, E.A.; Zhang, D.; Vassiliadis, V.S. Model-based real-time optimisation of a fed-batch
cyanobacterial hydrogen production process using economic model predictive control strategy.
Chem. Eng. Sci. 2016, 142, 289–298. [CrossRef]
8. Cheng, Z.; Liu, X. Optimal online soft sensor for product quality monitoring in propylene polymerization
process. Neurocomputing 2015, 149, 1216–1224. [CrossRef]
9. Zhong, Y.; Yang, C.; Yuchen, C.; Xuhua, S. Process real-time optimization using Clonalg algorithm.
In Proceedings of the 27th Chinese Control and Decision Conference (2015 CCDC), Qingdao, China,
23–25 May 2015; pp. 743–748.
10. Moghaddam, M.; Nof, S.Y. Real-time optimization and control mechanisms for collaborative demand and
capacity sharing. Int. J. Prod. Econ. 2016, 171, 495–506. [CrossRef]
Big Data Cogn. Comput. 2018, 2, 3 15 of 15
11. Moksadur, R.; Avelin, A.; Kyprianidis, K.; Dahlquist, E. An Approach for Feedforward Model Predictive
Control for Pulp and Paper Applications: Challenges and the Way Forward. In Proceedings of the PaperCon
2017, Minneapolis, MN, USA, 23–26 April 2017; Volume 10.
12. Optico Project. Available online: www.opticoproject.eu (accessed on 29 May 2017).
13. PROPAT-Integrated Process Control. Available online: http://pro-pat.eu/ (accessed on 29 May 2017).
14. Consens Project. Available online: http://www.consens-spire.eu/ (accessed on 29 May 2017).
15. COCOP SPIRE H2020 Project. Available online: http://www.cocop-spire.eu/ (accessed on 29 May 2017).
16. Lv, Z.; Song, H.; Basanta-Val, P.; Steed, A.; Jo, M. Next-Generation Big Data Analytics: State of the Art,
Challenges, and Future Research Topics. IEEE Trans. Ind. Inform. 2017, 13, 1891–1899. [CrossRef]
17. Al-Fuqaha, A.; Guizani, M.; Mohammadi, M.; Aledhari, M.; Ayyash, M. Internet of Things: A Survey
on Enabling Technologies, Protocols, and Applications. IEEE Commun. Surv. Tutor. 2015, 17, 2347–2376.
[CrossRef]
18. Basanta-Val, P.; Audsley, N.C.; Wellings, A.J.; Gray, I.; Fernandez-Garcia, N. Architecting Time-Critical
Big-Data Systems. IEEE Trans. Big Data 2016, 2, 310–324. [CrossRef]
19. Wang, K.; Fu, J.; Wang, K. SPARK—A Big Data Processing Platform for Machine Learning. In Proceedings of
the 2016 International Conference on Industrial Informatics—Computing Technology, Intelligent Technology,
Industrial Information Integration, Wuhan, China, 3–4 December 2016; pp. 48–51.
20. Norman, D.A.; Draper, S.W. User Centered System Design: New Perspectives on Human-Computer Interaction;
L. Erlbaum Associates: Mahwah, NJ, USA, 1986; ISBN 0898597811.
21. Mirnig, A.G.; Meschtscherjakov, A.; Wurhofer, D.; Meneweger, T.; Tscheligi, M. A Formal Analysis of the
ISO 9241-210 Definition of User Experience. In Proceedings of the 33rd Annual ACM Conference Extended
Abstracts on Human Factors in Computing Systems—CHI EA ’15, Seoul, Korea, 18–23 April 2015; ACM Press:
New York, NY, USA, 2015; pp. 437–450.
22. Clark, P.G.; Lobsitz, R.M.; Shields, J.D. Documenting the evolution of an information system. In Proceedings
of the IEEE National Aerospace and Electronics Conference, Dayton, OH, USA, 22–26 May 1989;
pp. 1819–1826.
23. International Organization for Standardization; International Electrotechnical Commission; Institute of
Electrical and Electronics Engineers; IEEE-SA Standards Board. Systems and Software Engineering: Architecture
Description; ISO: Geneva, Switzerland, 2011; ISBN 9780738171425.
24. Rozanski, N.; Woods, E. Software Systems Architecture: Working with Stakeholders Using Viewpoints and
Perspectives; Addison-Wesley: Boston, MA, USA, 2005; ISBN 0321112296.
25. Dzida, W. Developing Scenario-Based Requirements and Testing them for Minimum Quality. In Proceedings
of the HCI International (the 8th International Conference on Human-Computer Interaction) on
Human-Computer Interaction: Ergonomics and User Interfaces-Volume I-Volume I, Munich, Germany,
22–26 August 1999; Bullinge, H.-J., Ziegler, J., Eds.; Lawrence Erlbaum: Hillsdale, NJ, USA, 1999;
pp. 1205–1208.
26. Chapman, P.; Clinton, J.; Kerber, R.; Khabaza, T.; Reinartz, T.; Shearer, C.; Wirth, R. Crisp-Dm 1.0.
CRISP-DM Consort. 2000, 76. [CrossRef]
27. Shearer, C.; Watson, H.J.; Grecich, D.G.; Moss, L.; Adelman, S.; Hammer, K.; Herdlein, S. The CRISP-DM
model: The New Blueprint for Data Mining. J. Data Warehous. 2000, 5, 13–22.
28. Pechter, R. What’s PMML and what’s new in PMML 4.0? ACM SIGKDD Explor. Newsl. 2009, 11, 19–25.
[CrossRef]
29. Pivarski, J.; Bennett, C.; Grossman, R.L. Deploying Analytics with the Portable Format for Analytics (PFA).
In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining—KDD ’16, San Francisco, CA, USA, 13–17 August 2016; ACM Press: New York, NY, USA, 2016;
pp. 579–588.
30. ISO 22400-1:2014—Automation Systems and Integration—Key Performance Indicators (KPIs) for
Manufacturing Operations Management—Part 1: Overview, Concepts and Terminology. Available online:
https://www.iso.org/standard/56847.html (accessed on 29 May 2017).
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).