Attachment 0
Attachment 0
Attachment 0
Ares(2020)476621 - 25/01/2020
Project co-funded under Horizon 2020 Research and Innovation EU programme, grant agreement no. 732194
Table of contents
ABSTRACT 4
EXECUTIVE SUMMARY 5
INTRODUCTION 6
Overview and structure of the Document 6
Tracked changes 6
D4.2 Page 2 of 61
Data acquisition framework
CSDA-2: Upload/Update a dataset into CKAN (from a URL + decompress) 31
CSDA-3: JSON-based transformations + Update a dataset into CKAN 33
CSDA-4: External-script transformations + Update a dataset into CKAN 36
Application of static core templates: Integrated acquisition and FIWARE/GeoJSON
transformation of Municipality of Trento datasets 39
Deployment and release of the framework 42
Deployment 42
Release 42
Static Requirements Validation 42
CONCLUSIONS 55
REFERENCES 56
1030 58
1031 58
D4.2 Page 3 of 61
Data acquisition framework
ABSTRACT
This document presents the main building blocks of the QROWD Data Acquisition
Framework. The framework is based in the definition of several data flows created by
combination of Apache NiFi templates generated in the scope of QROWD. The
document presents the way new datasets can be acquired in CKAN (static datasets),
or uploaded to the Orion Context Broker (dynamic or streaming data) for further use.
D4.2 Page 4 of 61
Data acquisition framework
EXECUTIVE SUMMARY
This document is the second deliverable of QROWD WP4 and presents the main
building blocks of the QROWD Data Acquisition Framework for static and dynamic
datasets. It is intended mainly for developers of data-enabled applications that would
like to make use of a data portal (CKAN as selected implementation) and a broker
(the FIWARE Orion Context Broker as selected implementation) as data repositories.
The reported framework is in context of the QROWD high-level architecture
described in D8.1.
The document explains the rationale behind the data acquisition framework. The
framework is based in the definition of data flows created by combination of Apache
NiFi templates generated in the scope of QROWD. The templates are useful for
different purposes, ranging from uploading new datasets to CKAN (static datasets),
or gathering the latest status of the dynamic of streaming data to the Orion Context
Broker (dynamic or streaming data). The combination of those templates allow
developers to easily implement new data flows by using the user interface providing
by NiFi with none or minimal programming effort.
The deliverable also explains the methodology and best practices to create data
flows and perform simple data transformations in NiFi. The document shows how
this is done to generate datasets in multiple formats and flavours and upload them to
CKAN.
D4.2 Page 5 of 61
Data acquisition framework
1. INTRODUCTION
1.1. Overview and structure of the document
This document is the second deliverable of WP4. It is reporting about the Data
Acquisition Framework developed in the scope of QROWD. In particular, the
document reports on the results of task 4.2 (Data acquisition), and as such is
providing the methods and software artefacts able to fulfill the requirements for data
acquisition gathered from the pilot use cases (WP1 and WP2). This involves the
development and deployment of the data acquisition functionalities specified
originally in deliverable D4.1. [D4.1] and polished after several iterations with the
project partners.
This document has been revised in November 2019 to keep updated the catalog with
the latest information available in relation to datasets by the end of the project.
Besides cosmetic changes, the main updates are the following:
● Add and update the new NiFi pipelines in "ANNEX 1: Deployed dataflows"
D4.2 Page 6 of 61
Data acquisition framework
This section provides the context of the data acquisition framework in relation to the
QROWD architecture.
D4.2 Page 7 of 61
Data acquisition framework
This deliverable presents the data acquisition framework for the collection of
QROWD-relevant data. According the definition given in deliverable D4.1, the
acquisition framework should allow the extraction of data from distributed and
heterogeneous sources and make it available for further usage in the project. A fully
identification and assessment of the datasets to be acquired by QROWD was done
in D4.1, and more specifically in the so-called Data Catalog, which was the result of
of D4.1: QROWD Live Data Catalog (LDC).
However, this document describes just the main data acquisition flows in QROWD.
These flows are related to the main static data collection, described in section 3 of
this document, and to the dynamic data collection, described in Section 4. Other
QROWD work packages are collecting external data from their own internal
purposes, such as:
● Real-time or mobile phone sensor data acquisition, descripted in D2.4 “iLOG”.
● News/Social Media streaming acquisition mechanisms, fully described in D4.4
“Crowdsourced multilingual data harvesting and extraction framework”.
Figure 2 shows a general view of the acquisition framework in the context of static
and dynamic data acquisition. As depicted in the figure, other data flows suchs as
coming form city sensors might be supported by the framework with the inclusion of
FIWARE IoT Agents.
For the static and dynamic data acquisition framework presented in this deliverable,
following mechanisms are put available to assist the QROWD data value chain:
● Set of acquisition and transformation components based on NiFi [Apache
NiFi], CKAN [CKAN], and FIWARE Context Broker [FIWARE Context
D4.2 Page 8 of 61
Data acquisition framework
The presented framework will provide users and processes data acquisition
mechanisms for collecting data of different nature (static and dynamic/streaming)
and from different sources (i.e. external services and repositories to QROWD or
internal processes) and make them available to other QROWD processes and other
actors.
In particular, the static data framework aims to provide users and processes
acquisition facilities such as:
● Uploading new datasets to the central data repository, CKAN. These datasets
may come from different sources of information:
○ Dataset coming from the Municipality of Trento services
○ VCE tool (See D3.2 Crowdsourcing services).
○ QROWD Fusion and Interlinking process from QROWD WP5
○ Available datasets such as: OpenStreetMap [OpenStreetMap]
○ Other sources
● Uploading new versions of existing datasets (versioning/backup)
● Transforming and uploading different formats of existing datasets. Some of
the formats managed in QROWD are:
○ FIWARE format for data integration/homogenization purposes.
FIWARE data models1 [FIWARE Data Models] are a set of
harmonized data model for smart cities applications.
○ GeoJSON [H. Butler et al, 2016] for visualization purposes
○ RDF2 [World Wide Web Consortium, 2014] format for analytical
purposes
On the other hand, the dynamic data acquisition framework uses the Orion Context
Broker (FIWARE) to persist contextual information about the state of several assets
of a city (i.e. parking lots). Consequently, the dynamic acquisition framework is
intended for processes with the need of receiving information about the real time
status of the city and do something with it, such as the visualization of real-time data
in the Municipality and Citizen dashboards.
1 https://www.fiware.org/developers/data-models/
2 https://www.w3.org/RDF/
D4.2 Page 9 of 61
Data acquisition framework
This way of handling the data gives NiFi, out of the box, some useful features that
include:
● It is highly scalable, can be clustered and easily scaled horizontally so each
flowfile can be processed in a different system.3
● It allows to use back pressure4 in the system. The amount of data that is
stored because the target system is not able to handle it is configurable and it
also allows to tell the source system to hold the ingestion.
● Provenance5 of data is stored, so you can know what processors modified
that data.
In a nutshell, NiFi moves the data in the FlowFile from processor to processor
through the connections that link them. It can also be used to route the FlowFiles
through different paths depending on attributes or in the data itself.
On top of this engine, Apache NiFi provides the possibility to create your own custom
processors in case you have any functionality that cannot be covered with the
standard set of processors bundled with it. This way anyone can easily implement
the required interfaces and create a processor that can, for example, read or write
data to a CKAN system with NiFi but integrated in the NiFi platform without the
developer having to worry about coding most of the functionalities that NiFi offers
because they are added seamlessly to the custom processor.
2.4.2. OASC
2.4.2.1. Context Broker
3 http://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#clustering
4 https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#back-pressure
5 https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#data_provenance
D4.2 Page 10 of 61
Data acquisition framework
● Entities. The virtual representation of an object of the real life. The entities
have an identifier and a type.
● Attributes. Information relative to feature of the entities, also can contains
metadata.
● Domain Attributes. A way to create sets of elements and group the attributes
with a similar logic.
● Context elements. “The data structure used for exchanging information about
entities”.
Context Broker is used in projects where you need to develop and deploy with data,
as an intermediate component to connect data producers and data consumers. The
main properties of this component can be resume as follows:
i) Register elements of the context.
ii) Manage these elements, consult and update
iii) Subscribe to these elements, that allows us to receive a notification
if for example the data content a change.
Context Broker is a good tool to work with a lot of different kind of data, and also are
adapted to use geolocation data and IOT (Internet Of Things) data, which are a kind
of data used to define what we call a “Smart City”.
2.4.2.2. CKAN
Several definitions of CKAN can be found in the official website 6, one of them states
that “CKAN is a fully-featured, mature, open source data portal and data
management solution” which can explain why CKAN is broadly used in Europe as
data catalog which allows data publishers sharing collection of data with general
users.
6 https://ckan.org/about/
7 https://ckan.org/portfolio/publish-and-manage-data/
8 https://docs.ckan.org/en/latest/api/index.html
9 https://ckan.org/portfolio/datastore/
10 https://ckan.org/portfolio/visualization/
11 https://ckan.org/features/
D4.2 Page 11 of 61
Data acquisition framework
In addition to that, the Open and Agile Smart City initiative (OASC)[OASC] which
aims to provide best practices for the construction of smart cities system, rely on
CKAN as one of its main pillars. It propose CKAN platform as the base standard
platform for publication of static file datasets 12.
In the following section it can be found more details about the particular use of CKAN
in QROWD and the operative defined.
As it was commented in D4.1, “static data, also known as data-at-rest, is data that
does not or barely change after its recording” and putting it in the context of
QROWD, it refers to those datasets related with the infrastructure of the city, i.e: bike
racks, paid parking zones, parking for disability people, bike lanes, schools, libraries,
e-car charging stations, etc.
12 http://oascities.org/wp-content/uploads/2016/02/Open-and-Agile-Smart-Cities-Background-
Document-3rd-Wave.pdf
D4.2 Page 12 of 61
Data acquisition framework
Figure 4 shows the architecture to acquire, update and transform static datasets in
CKAN. They are the acquisition and transformation NiFI box what will be fully
explained in Section 3.4.
D4.2 Page 13 of 61
Data acquisition framework
Several acquisition processes take part in the process of uploading datasets into the
QROWD CKAN: Processes that collect datasets from the Municipality of Trento,
tools collecting crowdsourcing information to complete the information about
Municipality infrastructure, interlinking and fusion processes, validation processes,
etc.
On the other hand, additional processes will interact with CKAN to add different
formats of existing datasets: FIWARE transformations to integrate datasets into a
common format, GeoJSON transformations to make easier the visualization process
in the dashboard, etc.
Apart from uploading a new dataset or adding new formats to a existing one, the
framework also allows update datasets by means of adding a new version of an
existing format in an existing dataset. In that case, and in order to avoid
inconsistencies between different formats on the same dataset, the update of a
format must be complemented with the update of the rest of formats in the dataset,
to provide all the formats referencing same content/version. Therefore, the user
should arrange not only a format update but also a set of transformations to ensure
the matching between the different formats in a dataset version.
To enable this, upload and transformation processes must be linked in some cases.
Section 3.6 details an example of the integrated dataset acquisition and ulterior
transformations to FIWARE and /GeoJSON formats of some datasets from the
Municipality of Trento. This complex dataflow ensures the automatic triggering of all
the transformations needed when a new version of a dataset is uploaded. In the
example shown in Figure 18, when a new version of the impianti-sportivi dataset is
uploaded, not only the resource containing the raw information is updated, but also
the FIWARE and GeoJSON format are automatically update.
All the acquisition and transformations will be made through a set of NiFi-based
dataflows which make easier the integration with further components in the QROWD
general architecture.
In QROWD, the CKAN repository will make use of following CKAN entities: Datasets,
resources, and metadata. Datasets entities will be used for the abstract concept of a
dataset, resources for the physical manifestations of a dataset in several formats,
and a set of basic metadata for datasets annotation.
D4.2 Page 14 of 61
Data acquisition framework
A dataset might have several formats for the same data, therefore a CKAN dataset
will have such a number of resource entries as formats availables for this particular
dataset. The naming convention will be: {DatasetID} + {_} + {FormatName} + {.ext}.
For instance, if we have a “FIWARE” version and a “Dashboard/GeoJSON” version
for the Bikes_MT dataset, the data acquisition framework will allow users have a
dataset in CKAN with name “Bike_Racks_MT” with two associated resources:
“Bike_Racks_MT_FIWARE.json” and “Bike_Racks_MT_GeoJSON.json”.
Finally, for the annotations associated to a dataset, there will be a metadata file, in
the form of a new resource, attached to the dataset with name: {DatasetID} + {_} +
{Metadata}+ {.ext}. Not all datasets will have a metadata file associated. In addition
to this metadata file, all datasets will be annotated with a set of basic metadata that
will be stored in the metadata fields of the dataset. These basic metadata are:
● Owner/Organization
● Visibility
● tag1: “Project: String”. By default “QROWD”
● tag2: “Type versioning”:{QROWD_lastVersion, QROWD_historical}
● tag3: “Type dataset”:{QROWD_source, QROWD_fusion, QROWD_official}
● tag4, …, tagN: “String”. Free tags.
D4.2 Page 15 of 61
Data acquisition framework
Versioning
The acquisition framework will provide a “versioning” mechanism. The user will be
able to manage different versions of a particular dataset, that is, when the content of
a dataset is modified and the user want to store the new version without missing old
versions, the framework will automatically manage the update of a dataset
backuping the previous version. Therefore, when a new version of a existing dataset
arrives into the system:
1. A new dataset is created to backup all the existing formats (resources) of the
“fresh” dataset. The backup dataset will have same name than the “fresh”
dataset DatasetID + “timestamp”. It could be considered as a snapshot (a
copy of the state) of the “fresh” dataset at a particular point in time.
2. The “fresh” dataset (remember it contains always the last version) will be
overridden with the new version. In order to do that, the resource that
matches the format of the new version will be replaced by the new version
and the rest of outdated resources will be removed from the “fresh” dataset.
3. If any format should be recreated based on the new version, a notification
should trigger to other components in charge of creating new formats to
complete the “fresh” and last version of the dataset. The notification could be
D4.2 Page 16 of 61
Data acquisition framework
done through the Context Broker or any other mechanism managed by NiFi
processors.
Figure 6 shows an abstract idea of how the versioning takes place: One dataset
containing the last version (always the same) and N datasets for the historical
versions.
A CKAN repository example of the idea depicted above can be seen in the Figure 6.
Four CKAN datasets, for the same “zone-traffico-limitato” dataset, can be seen:
● “zone-traffico-limitato” containing the last version and,
● Three additional ones for storing old versions:
○ “zone-traffico-limitato20181005_124813”, a version freezed on
October, 5, 2018
○ “zone-traffico-limitato20181004_171050”, a version freezed on
October, 5, 2018 at different time.
○ “zone-traffico-limitato20181004_164049”, a version freezed on
October 4, 2018.
D4.2 Page 17 of 61
Data acquisition framework
Ownership
As the name of a CKAN dataset usually reflects the information contained within it
and we use the name as unambiguously identificator, this normative suggests
adding the creator name as part of the dataset name.
13 https://docs.ckan.org/en/2.8/maintaining/authorization.html
D4.2 Page 18 of 61
Data acquisition framework
This section describes the NiFi core templates that implement the main
functionalities offered by the static data acquisition framework:
● CSDA-1: Upload/Update a dataset into CKAN, a generic data flow in charge
of taking a dataset (file) from a remote URL path and updating a CKAN
repository. The process contemplates the “versioning” for a dataset:
○ uploading it as a new dataset (if the datasets does not exist) or
○ updating a existing CKAN dataset and store the former content as
historical dataset.
● CSDA-2: Upload/Update a dataset into CKAN (with decompression), same
data flow than before but allowing the user download a dataset in the form of
a data compressed packaged and extract/filter files to be uploaded.
● CSDA-3: JSON-based transformations/versions of a dataset , by mean of this
functionality the user will be able to download a resource from a remote URL,
perform a set of transformations over the dataset and upload the result as a
new format of a existing dataset. The transformations will be in terms of
○ format: the original dataset will be transformed into JSON format;
○ structure: after the JSON transformation the user will have the
possibility of performing transformations in the JSON structure, that is:
change the name of the fields, create new JSON objects, or delete
existing ones.
D4.2 Page 19 of 61
Data acquisition framework
Finally highlighting that CSDA flows are based on standards NiFi processors and
specific new NiFi processors developed in the context of QROWD. For each
processors will be indicated whether is standard or custom processors.
3.4.1. CSDA-1: Upload/Update a dataset into CKAN (from a URL)
One of the main functionalities of the acquisition framework is to provide users with
the possibility of automatically uploading/updating a dataset into CKAN including
versioning support. It takes the dataset, in the form of a digital file format, from a
remote URL, and upload it into a new dataset in case the dataset does not exist
previously in CKAN, or update an existing dataset in case it is an existing one.
D4.2 Page 20 of 61
Data acquisition framework
This process is almost the same than before with the additional functionality of
decompressing the downloaded file and filtering the extracted files to determine
which of them will go through the uploading process. The previous process allows
downloading exactly one file which will go through the process of uploading into
CKAN. This particular process will allow the user to download a file/package,
decompress the package and apply a filtering step to those files the user is
interested in. The resulting files from this descompres/filtering step will go through
the uploading process.
Examples of use:
D4.2 Page 21 of 61
Data acquisition framework
● Most of the datasets downloaded from the Municipality of Trento and used in
the QROWD dashboard.
Figure 9 shows different NiFi processors that take place in the dataflow.
This data flow includes one more processor to the NiFi processors described in the
previous section:
● “UnpackContent” (standard), a processor in charge of extracting files from a
compressed package and filtering the desired files that will be incorporated to
the workflow.
D4.2 Page 22 of 61
Data acquisition framework
This dataflow offers the user very basic functionality of typical JSON-based DTL
(download, transform and load) process. It will allow to download a JSON format
dataset from a URL, to transform the original format into GeoJSON format, to
perform changes in the JSON structure, and to upload the resulting dataset into
CKAN.
Since this dataflow does not generate backup of any dataset, it produces new
formats of existing versions and no new versions, it’s worth mentioning that the
output of the transformation (the new format) is assumed to be uploaded as a new
resource (new file) into an existing dataset, in particular into the dataset from which
the file to be transformed was downloaded.
Figure 10 shows different NiFi processors that take place in the dataflow.
D4.2 Page 23 of 61
Data acquisition framework
This template offers the possibility to download a dataset (input file) from a remote
URL, perform a transformation over the dataset using external procedures and store
the resulting dataset (output file) into a CKAN repository. The transformations will be
command line-based executions of external script developed by the user.
As in the previous case, this dataflow does not generate a backup of any dataset. It
is worth mentioning that the output of the transformation (the new format) is
assumed to be uploaded as a new resource (new file) into an existing dataset, in
particular into the dataset from which the file to be transformed was downloaded.
14 http://jolt-demo.appspot.com/#inception
D4.2 Page 24 of 61
Data acquisition framework
Figure 11 shows the different processors used in the data flow and the properties
that should be configured by the user for each of them:
D4.2 Page 25 of 61
Data acquisition framework
General recommendations
● An API Key-based authorization in required for operating with CKAN API. For
getting an user and API Key, please register at http://CKAN_host/user/register
● To configure a processor in NiFi double-click on it and go to the “properties”
tab.
D4.2 Page 26 of 61
Data acquisition framework
Figure 13 shows the NiFi GUI to add a new element to the NiFi dataflow based on
previous templates. In this case the user should select “Upload_CKAN” template and
below you can find the instructions to run it.
D4.2 Page 27 of 61
Data acquisition framework
D4.2 Page 28 of 61
Data acquisition framework
Figure 14 shows the NiFi GUI to add a new element to the NiFi dataflow based on
previous templates. In this case the user should select
“Upload_CKAN_with_decompresion” template and below you can find the
instructions to run it.
In this case the user should also configure following procesor (before Step 2 in the
previous template):
D4.2 Page 29 of 61
Data acquisition framework
D4.2 Page 30 of 61
Data acquisition framework
-
Example: “QROWD, QROWD_historical,
QROWD_source”.
5. Step 5: Go to “CKAN_Flowfile_Uploader” processor, go to “properties” tab
and fulfill:
- Mandatory
- “CKAN Url”, URL of the CKAN repository
- “File Api_key”, API Key for operating with CKAN API.
- “Organization id”, the creator/owner of the dataset. This property
is always needed, the first time to add the organization to a
dataset, and following times to check if the organization passed
corresponds to the ownership of dataset in CKAN. It should be
lowercase.
- “Name_of_the_package”, the CKAN dataset that will host the
new version. According to the normative defined in Section 3.3,
just “fresh” datasets are subject to keep the last version. It
should be lowercase.
- “Package visibility”, the accessibility of the dataset: ”public”, if
anyone can access to the dataset without restrictions, or
“private” restricted to some users.
- “Comma-separated Tag List”, here the user might specify
different alphanumeric tags. As it was explained in Section 3.3
at least three tags should be fulfilled to upload a new dataset:
- One tag indicating the project name: “String”. By default
and in the context of the QROWD project the user must
set always: “QROWD”.
- One tag indicating the type of dataset. As we are
configuring the uploader processor, the user must set:
“QROWD_lastVersion”
- One tag indicating the type of dataset (from a use point of
view). In the context of QROWD the user has to select
one of the following values: {QROWD_source,
QROWD_fusion, QROWD_official}
- Example: “QROWD, QROWD_lastVersion,
QROWD_fusion”.
Figure 15 shows the NiFi GUI to add a new element to the NiFi dataflow based on
previous templates. In this case the user should select
“JSON_based_transformation_upload_CKAN” template and below you can find the
instructions to run it.
D4.2 Page 31 of 61
Data acquisition framework
D4.2 Page 32 of 61
Data acquisition framework
- Mandatory
- “id”. A JSON path expression that retrieve the id of the JSON
object. Needed for the Merge.
- Other
- add custom user attributes to flowfile attributes. The value for
these attributes will be retrieved from the JSON object
processed as flowfile. These attributes can be used in the
“JoltTransformationJSON” processor.
4. Step 4: Go to “UpdateAtribute” processor, go to “properties” tab and fulfill:
- Other.
- “type”, fill the “type”, a string to indicate a shared type for all the
elements. It can be useful to indicate the FIWARE type when the
transformation is from JSON structure to FIWARE structure.
5. Step 5: Go to “JoltTransformationJSON” processor, go to “Advance” (button
left side) and mandatorily introduce the JOLT specification which will
transform an input JSON structure into another JSON structure. A reference
can be find here16. A new flowfile is created with the new structure.
6. Step 6: “MergeContent” processor does not need to be parameterized. After
this processor all the flowfiles will be merge into a single one. Internally the
processor will make use of the attribute “id” defined in “EvaluateJsonPath”.
7. Step 7: Go to “UpdateAttribute” processor, go to “properties” tab and fulfill:
- Mandatory:
- “filename”, the name of the flowfile. This name will be used by
“CKAN_Flowfile_Uploader” to set the name of the resource (or
file) to be added to CKAN. It is a fixed string.
8. Step 8: Go to “CKAN_Flowfile_Uploader” processor, go to “properties” tab
and fulfill:
- Mandatory
- “CKAN Url”, URL of the CKAN repository
- “File Api_key”, API Key for operating with CKAN API.
- “Organization id”, the creator/owner of the dataset. This property
is always needed, the first time to add the organization to a
dataset, and following times to check if the organization passed
corresponds to the ownership of dataset in CKAN. It should be
lowercase.
- “Name_of_the_package”, the CKAN dataset that will host the
new format. According to the normative in Section 3.3, the
output of the transformation (the new format) is assumed to be
uploaded as a new resource (new file) into an existing dataset,
in particular into the dataset from which the file to be
transformed was downloaded. It should be lowercase.
- “Package visibility”, the accessibility of the dataset: ”public”, if
anyone can access to the dataset without restrictions, or
“private” restricted to some users.
- “Comma-separated Tag List”, here the user might specify
different alphanumeric tags. As it was explained in Section 3.3
at least three tags should be fulfilled to upload a new dataset:
16 http://jolt-demo.appspot.com/#inception
D4.2 Page 33 of 61
Data acquisition framework
Figure 16 shows the NiFi GUI to add a new element to the NiFi dataflow based on
previous templates. In this case the user should select
“External_script_transformation_upload_CKAN” template and below you can find the
instructions to run it.
D4.2 Page 34 of 61
Data acquisition framework
D4.2 Page 35 of 61
Data acquisition framework
The static data acquisition framework provides a set of core NiFi templates that can
be used and connected between them in order to create more complex workflows.
As part of the framework, and as a example of utilization of core templates, a set of
more advanced ad-hoc and parameterized workflows will be put available for the
project.
D4.2 Page 36 of 61
Data acquisition framework
For the creation of these workflows it is assumed some basic NiFi knowledge, since
it is necessary to add some NiFi components (ports, connectionS, ETC.) to/between
the existing templates.
Figure 18 shows an example of the workflow. In particular the workflow is for the
acquisition of “impianti-sportivi” dataset.
D4.2 Page 37 of 61
Data acquisition framework
Each of the sub-workflow now incorporates (input or/and output) ports, as new NiFi
component, to enable the connection between them.
17 http://webapps.comune.trento.it/cartografia/gis/dbexport?
db=base&sc=istruzione_sport&ly=impianti_sportivi&fr=gml
D4.2 Page 38 of 61
Data acquisition framework
The following NiFi components has been deployed into the QROWD server in
Leipzig hosted by InfAI with partner number 7:
● Basic Templates
○ CSDA-1: Upload/Update a dataset into CKAN (from a URL)
○ CSDA-2: Upload/Update a dataset into CKAN (from a URL +
decompress)
○ CSDA-3: JSON-based transformations + Update a dataset into CKAN
○ CSDA-4: External-script transformations + Update a dataset into CKAN
● N advanced workflows. Please refer to table in ANNEX 1: Deployed dataflows
to see a relation of dataflows deployed.
● Bundles
Release
The CSDA templates and the specific new NiFi processors developed are released
under licence Apache 2.0 on the following QROWD git repositories:
- CSDA templates18, collection of templates created for the QROWD project to
be used in NiFi.
- NiFi CKAN processors19, custom Apache Nifi processor to upload files to
CKAN.
D4.1 defined a set of requirements that the ACSD framework should accomplish.
This section aims at assessing the degree of fulfillment of those requirements, as
depicted in Table 1.
18 https://github.com/QROWD/NiFi-templates
19 https://github.com/QROWD/nifiCkanProcessor
D4.2 Page 39 of 61
Data acquisition framework
intermediate
provide an
repository
intermediate
(CKAN-like)
repository or dataset
that feed the
portal to store (can
storage RDF- repository
store the resource
transformati
internally, or store it
on
simply as a link)
component(
acquired datasets
s)
Each
dataset will
i.e:
have basic
Information
tags for
about the
owner
Information or data: title,
Metadata (publisher),
DA-104 “metadata” about the publisher, A
about data visibility, and
data date,
a associated
schema file,
metadata
provenance.
file for
..
provenance
metadata
The
i.e: RDF-
Backend
transformati
The ACSD should consumer
on
Access provide means to system will
component
DA-105 mechanism allow subsequent PA allow
should be
s systems to browse retrieve
able to
and find the data datasets
access the
based on its
data
identifier
The entry
Ideally, the ACSD point NiFi
should pull processor
data from the source i.e.Process (InvokeHTT
Batch job
DA-106 systems at executed as A P) of each
execution
scheduled intervals a batch job acquisition
using batch template
components. allow
scheduling
i.e: For each
new dataset
For each
define the
acquisition
process to
A set of first steps template it is
Definition of import it
toward starting up provided a
datasets including:
DA-107 the acquisition A guided with
acquisition access way,
process should be a detailed
process transformati
defined set of steps
ons,
to be
metadata,
followed
timeliness,
etc...
DA-108 Versioning The ACSD should be i.e: Update A A versioning
able to track different datasets mechanism
versions of datasets with new is provided
D4.2 Page 40 of 61
Data acquisition framework
and
implemente
releases d in the
acquisition
framework
i.e: Some of
Dataset in
the used
Security Datasets can be CKAN can
DA-009 datasets are A
restrictions public or private be public or
the private
private
usage
A “JSON-
based
i.e: A simple transformati
extraction, on and
transformati upload
For some datasets on and CKAN”
ETL the ACSD could loading template is
DA-110 A
process implement classical process provided to
ETL process could be allow basic
applied to extract,
specific transformati
datasets on and
upload
process
The
“External-
script
The ACSD should be transformati
able to on and
File format i.e: XML,
DA-111 access/transform A upload
access type JSON, GML
data from several file CKAN” allow
format transformati
ons between
different
formats
Non-functional requirements
D4.2 Page 41 of 61
Data acquisition framework
Based on this, we can state that the majority of the requirements of the ACSD have
been accomplished.
We defined real-time or streaming data for the purposes of QROWD in D4.1 as, “ a
type of dynamic data with a very high or continuously rate of change and usually
with the need to be consumed immediately after its production”. Therefore, in the
case of the pilots of QROWD, a streaming dataset can be seen as collections of data
that provide information about the current status of the city, such as real-time
availability of bike-sharing, real time status of underground parking, etc.
Figure 19 shows the main elements needed to acquire, update and transform
dynamic datasets (i.e. Underground parking status, Bike-racks for bike-sharing
status) in the FIWARE Orion Context Broker using NiFi.
D4.2 Page 42 of 61
Data acquisition framework
The entire acquisition process is made through a set of NiFi-based processors which
make easier the integration with further components of the QROWD general
architecture (Figure 1).
The main consideration the user should take into account to operate with the
component provided is that all the entities posted or updated into the Context Broker
should fit the FIWARE data models 21 where possible. For instance, FIWARE
provides data models for parking, such as off-street parking 22, that are of interest to
represent specific datasets managed by QROWD.
This NiFi process provides the means to take a given dataset in JSON format and
transform it to the FIWARE model and update it to the Context Broker as shown in
Figure 20.
20 https://forge.fiware.org/plugins/mediawiki/wiki/fiware/index.php/FI-WARE_NGSI-
10_Open_RESTful_API_Specification
21 https://www.fiware.org/developers/data-models/
22 https://fiware-datamodels.readthedocs.io/en/latest/Parking/OffStreetParking/doc/spec/index.html
D4.2 Page 43 of 61
Data acquisition framework
D4.2 Page 44 of 61
Data acquisition framework
The logic of the dataflow shown in Figure 21 is quite similar to the initial steps
presented in the static template “JSON-based transformations + Update a dataset
into CKAN” in Sections 3.4 and 3.5. Refer back to those sections for further details.
D4.2 Page 45 of 61
Data acquisition framework
D4.2 Page 46 of 61
Data acquisition framework
D4.2 Page 47 of 61
Data acquisition framework
Deployment
Release
The CSDA templates developed are released under licence Apache 2.0 on the
following QROWD git repository:
- CSDA templates24, collection of templates created for the QROWD project to
be used in NiFi.
23 http://jolt-demo.appspot.com/#inception
24 https://github.com/QROWD/NiFi-templates
D4.2 Page 48 of 61
Data acquisition framework
Functional requirements
D4.2 Page 49 of 61
Data acquisition framework
schemes. Such as requirement
flat files, database XF, JSON, s of the
dumps or SQL SQL dumps, project to
interfaces. The list of SQL acquire
access types is interfaces, datasets
driven by the etc. from
characteristics of databases
datasets.
The system should
be able quickly and
Variety of
DA- easily to integrate A
data
205 and expose data
sources
from a variety of data
sources using APIs.
Non-functional requirement
25 https://nifi.apache.org/
D4.2 Page 50 of 61
Data acquisition framework
5. CONCLUSIONS
The document presented the main building blocks and functionalities offered by the
QROWD Data Acquisition Framework. The objective of the document was to present
the work done in the scope of WP4 for data acquisition, including software and
methodological support to enable data ingestion and transformation to fulfil the
requirements of the pilots of the project.
The users of the framework are mainly QROWD developers, but also any developer
of data-enabled applications who would like to make ingest data into CKAN or the
Orion Context Broker. Therefore the framework could be used to enable the
acquisition of data in the scope of QROWD, but also in an isolated fashion enable
data acquisition in the abovementioned repositories.
The main results presented in the document enable the acquisition of both static and
dynamic datasets. The document describes a set data flows created by combination
of Apache NiFi templates generated in the scope of QROWD. The templates can be
used and combined by developers in the NiFi GUI to define actual dataflows to allow
complex data acquisition pipelines to ingest static data into CKAN or dynamic data
into the FIWARE Orion Context Broker with none or minimal programming effort.
The document provides hints and describes best practices to facilitate the creation of
the data flows and perform simple data transformations in NiFi. The document shows
how this is done to generate datasets in multiple formats and flavours and upload
them to CKAN in the case of static data, or to the Context Broker in the case of
dynamic data. A specific dataset versioning mechanism in CKAN has been
implemented to enable the functionality needed in QROWD.
The document is accompanied with the deployment of the software of the Data
Acquisition Framework in the QROWD server located at the InfAI premises in
Leipzig. Most of the code will be made available under an open source license.
D4.2 Page 51 of 61
Data acquisition framework
6. REFERENCES
Apache NiFi. Website, Available at https://nifi.apache.org/. Accessed October 31,
2018
H. Butler, M. Daly, A. Doyle, et al. 2016. “The GeoJSON Format”, August 2016
World Wide Web Consortium, “RDF 1.1 Concepts and Abstract Syntax”, Feb 2014
D4.2 Page 52 of 61
Data acquisition framework
Data flow Dataset Template Description File Original URL CKAN URL
D4.2 Page 53 of 61
Data acquisition framework
D4.2 Page 54 of 61
Data acquisition framework
1030 elementari
1030 CSDA-2, Elementary elementari http:// http://
CSDA-4, schools -mt webapps.comu ckan.qrowd.aks
CSDA-3 Acquisition ne.trento.it/ w.org/dataset/
from cartografia/gis/ elementari-mt
Trento dbexport?
Municipalit db=base&sc=is
y, truzione_sport&
GeoJSON ly=elementari&f
transforma r=gml
tion,
Fiware.
D4.2 Page 55 of 61
Data acquisition framework
tion, ml
Fiware.
D4.2 Page 56 of 61
Data acquisition framework
D4.2 Page 57 of 61
Data acquisition framework
D4.2 Page 58 of 61
Data acquisition framework
D4.2 Page 59 of 61
Data acquisition framework
D4.2 Page 60 of 61
Data acquisition framework
D4.2 Page 61 of 61