Sensors: Collaboration-Centred Cities Through Urban Apps Based On Open and User-Generated Data
Sensors: Collaboration-Centred Cities Through Urban Apps Based On Open and User-Generated Data
Sensors: Collaboration-Centred Cities Through Urban Apps Based On Open and User-Generated Data
Article
Collaboration-Centred Cities through Urban Apps
Based on Open and User-Generated Data
Unai Aguilera 1,2, *, Diego Lpez-de-Ipia 1,2 and Jorge Prez 3
1 DeustoTech - Fundacin Deusto, Avda Universidades, 24, 48007 Bilbao, Spain; dipina@deusto.es
2 Facultad Ingeniera, Universidad de Deusto, Avda. Universidades, 24, 48007 Bilbao, Spain
3 TECNALIA, eServices, Velzquez 64-66, E-28001 Madrid, Spain; jorge.perez@tecnalia.com
* Correspondence: unai.aguilera@deusto.es; Tel.: +34-944-139-003 (ext. 3034); Fax: +34-944-139-166
Abstract: This paper describes the IES Cities platform conceived to streamline the development of
urban apps that combine heterogeneous datasets provided by diverse entities, namely, government,
citizens, sensor infrastructure and other information data sources. This work pursues the challenge
of achieving effective citizen collaboration by empowering them to prosume urban data across time.
Particularly, this paper focuses on the query mapper; a key component of the IES Cities platform
devised to democratize the development of open data-based mobile urban apps. This component
allows developers not only to use available data, but also to contribute to existing datasets with the
execution of SQL sentences. In addition, the component allows developers to create ad hoc storages for
their applications, publishable as new datasets accessible by other consumers. As multiple users could
be contributing and using a dataset, our solution also provides a data level permission mechanism
to control how the platform manages the access to its datasets. We have evaluated the advantages
brought forward by IES Cities from the developers perspective by describing an exemplary urban
app created on top of it. In addition, we include an evaluation of the main functionalities of the
query mapper.
1. Introduction
The increasing urbanisation is making city authorities, governments, companies and even citizens
to start thinking about alternative ways to manage the resources within a city. This goal is achieved not
only by pursuing more efficiency, i.e., to do more with less, but also by seeking to find ways to achieve
a higher level of satisfaction among citizens and economic agents within a city. Indeed, the success of a
city to retain and attract people and companies relies highly on its appeal, economic dynamicity and
awareness of the real needs of the diverse societal sectors that populate it. Only by meeting these three
premises, we can truly start talking about genuinely smart cities, i.e., cities that meet citizens needs.
Since the first public administrations started sharing their data as open data, the idea of open
government has spread around the world. Open data implies [1]: (a) more efficient and effective
government; (b) innovation and economic growth; (c) transparency and accountability and (d) inclusion
and empowerment. These arguments favour the open by default paradigm, but the openness of
government data is only one of the ingredients needed to address open government. involving local
entrepreneurs and citizens is also necessary. Indeed, most cities, territories and countries, which have
started adopting open government policies, have serious lacks on exploiting the potential of open data.
They have focused their attention only on implementing their open data portals, placing low effort on
bringing open data closer to entrepreneurs and citizens through suitable APIs, easily consumable by
application developers.
The IES Cities platform is a CIP European project aiming to promote user-centric mobile
micro-services that exploit both open data and user-supplied data. This platform contributes with an
urban apps-enabling technological solution and focuses on enabling citizens to create, improve, extend
and enrich the open data associated to a city, in which urban apps are based. The main stakeholders
of the resulting urban apps ecosystem are the citizens, SMEs and public administration within a
city. The IES Cities platform is grounded on the progress achieved lately in two key technological
areas: (a) open government and urban sensor generated datasets and (b) smartphones equipped with
different sensors, e.g., GPS, which can execute urban apps, i.e., offering services for citizens in different
domains (e.g., transport, security and so on). Furthermore, the IES Cities platform aims to demonstrate
through a range of two-phase pilots in four European cities (Bristol, Majadahonda, Rovereto and
Zaragoza) how the resulting apps can actually satisfy the needs and demands of their citizens by
fostering citizen-council collaboration.
In addition to presenting the global architecture of the IES Cities platform, this paper introduces
the query mapper, a reusable component of the architecture that allows data consumers and producers
to access the citys datasets in a homogenous manner, independently of the initial type of the data
source. This component provides a novel mechanism for accessing and contributing to the citys
dataset ecosystem. The aim of the query mapper is to allow developers to query and update datasets,
which are available in different formats, using SQL sentences. This component provides developers
with a mechanism for accessing static data sources (i.e., JSON or CSV), enabling them to access the
data required for their applications. In addition, this component does not only allows querying data,
but it also manages the update of data by creating and maintaining an internal storage that contains
user contributions.
This paper has the following structure: Section 2 reviews previous work related with open data
provision and management. Section 3 introduces the main characteristics of the IES Cities platform.
Section 4 describes a core component of the platform, namely the query mapper, whose purpose is to
streamline the consumption and production of urban data, focusing on how the component manages
user provided data and its permission system. Section 5 validates the platform by describing an urban
app created for Zaragoza, while Section 6 includes an experimental evaluation of the query mapper
main functionalities. Finally, Section 7 draws some conclusions and outlines future work.
2. Related Work
In the last years, some initiatives have emerged to foster the participation of citizens in their
cities through the consumption and production of urban data. This is the case of FixMyStreet [2] or
FixMyTransport [3], which allow users to report problems of their surroundings and improve them.
In this new paradigm, the citizen themselves gather the data, acting as sensors with their mobile
devices, capturing different variables about the city and its data. Usually, citizens, with the help of
sensors installed in their devices, gather and provide data related with the atmosphere (pressure, air
humidity, temperature, etc.) or other data from the environment (gas emissions, pollution, noise, etc.).
This data, with the one introduced into applications can be very useful to citys public administrations
and urban planners to perform improvements and interventions in the citys infrastructure.
Data generated by citizens and gathered through crowd-sourcing, term introduced by Howe in
its seminal paper [4], has been used for collecting encyclopaedia data (for example, Wikipedia [5]),
cartography data (e.g., OpenStreetMaps [6]) and environmental data, as in the previous examples.
The resultant datasets are useful for education, urban planning or application construction.
Our approach assembles smarter cities around the following future internet technologies: Internet
of Things (IoT) [7], linked and open data [8] and crowd-sourced data [9]. IoT technologies integrate
into smart cities providing knowledge about its situation and its citizens. This data sources offer their
knowledge, driving it accessible to consumers and extending it with information coming from other
sources or directly provided by citizens.
In this work, we propose the usage of an extendedly known query language, i.e., SQL, to enable
consumers the access to data sources initially available in different formats: CSV, JSON, RDF endpoints
and relational databases. Thanks to the query mapper, we aim to provide a homogeneous data
Sensors 2016, 16, 1022 3 of 25
access mechanism for application developers who already have some experience developing mobile
applications and accessing data with a widely used language such as SQL.
The idea of using a query language to access JSON data sources is not completely new and there
exist some previous examples in the literature. For example, NO-SQL stores such as CouchDB [10]
and MongoDB [11] allow accessing JSON data, but they use a document-oriented approach instead of
a relational one, which requires developers to understand a different access mechanism for the data.
However, our approach proposes the usage of SQL, a common language for data access.
In [12] the authors propose a solution where the JSON objects are stored into rows without
disaggregating them into multiple columns. However, this requires extending the SQL syntax with
new constructs that enable to map the data into virtual columns. Therefore, users require learning
this new constructs to query and update the JSON data. On the contrary, in [13] the authors propose
a method to store JSON documents in relational system disaggregating an object into multiple rows
and inserting them in a fixed structure table. As this solution stores the data aggregated inside the
cells of the table, it also requires specific implementations of the SQL language to manage that specific
format. Our solution, however, proposes the creation of a relational representation of the data that
disaggregates the objects by creating a row per JSON object and storing its properties in specific
columns. In our approach, the resulting relational representation is directly queriable using standard
SQL syntax without the need to transform the queries to the internal representation, as it occurs in the
previous approaches.
Apache Drill [14] is a solution that also provides access to heterogeneous data formats, including
CSV and JSON, by allowing the execution of SQL sentences. However, as this proposal uses HDFS [15]
for the storage of the JSON files, it requires consumers to understand the specific structure and nature
of the data (e.g., internal JSON structure) to query it. Furthermore, Apache Drill does not support the
execution of SQL updates, as users must insert data through the HDFS update mechanism. Therefore,
unlike our proposal, it requires developers to use and understand two different mechanism for data
access, instead of using a common SQL mechanism in both cases.
Our idea of simplifying the access to data sources has some resemblances with previously existing
solutions that enable users to perform queries using natural language and transform them to SQL
(e.g., Quepy [16]) or SPARQL (e.g., NL-SPARQL [17]). These automatic translation tools can be
extremely useful for final users who do not have a technical knowledge and require exploring or even
retrieving data from existing sources. Nevertheless, the usage of a formal query language (i.e., SQL),
although it requires from developers to have some specific skills and knowledge, increases the accuracy
when selecting data and reduces the ambiguity of and processing complexity of the open grammar
introduced by natural language. In addition, the usage of these natural language translation tools
does not directly provide a common access mechanism for other data source types, such as JSON and
CSV, which our approach does enable to query. However, these solutions could extend our approach
by providing a first layer that transforms natural language queries to SQL, easing the access for not
technical users to those datasets connected through the query mapper.
On the other hand, there are several initiatives trying to promote open APIs for smart cities.
For example, CitySDK [18] is a free and open-source project providing ways for citizens to release
data. This SDK provides also libraries that third parties can use to reuse data released by citizens.
Open311 [19] offers a platform for citizens to communicate issues about public spaces and services.
It also offers a public API for third parties to develop new applications using the data provided by
citizens. Besides, Open511 [20] is an open format and API to publish road event data. In the case of
our solution, instead of providing an API to access a single dataset, provides a heterogeneous interface
to query multiple datasets with different structures and data formats.
The publication and discovery of data sources is traditionally solved with the help of solutions
such as CKAN [21], Socrata [22] or Apache Stanbol [23], being the first one the most extended solution
for this problem.
Sensors 2016, 16, 1022 4 of 25
CKAN is an open-source data management system, which can be used to store, manage and
publish datasets in multiple formats. It provides a RESTful API for harvesting data, searching datasets
and retrieving data. However, the query language used by CKANs DataStore lacks the expressivity of
the language used in our solution (thanks to the usage of SQL) and the data storage only supports
tabular data. However, the solution provided by IES Cities enables to connect to any data source in
JSON, CSV or SPARQL format, allowing querying and updating of data with a sufficiently expressive
language such as SQL.
On the other hand, Socrata is a proprietary solution that provides a similar solution to CKAN to
access data; however, it also solves the data storage problem, providing a more integrated solution,
but also sharing some of the shame limitations that CKAN has when accessing heterogeneous data.
IES Cities, thanks to the query mapper, does not only allow serving new datasets but it also solves the
connection with legacy data sources that already exist within the citys ecosystem, while providing a
common access and update mechanism for all the sources.
Finally, Apache Stanbol is a solution designed to connect semantic technologies within existing
content management systems. However, its FactStore component lacks the tools to connect with
existing data sources and focuses more on the provision of new data sources. Our solution, on the other
hand, enables to provide a simpler query mechanism for SPARQL [24] data sources, which enables to
bridge the gap existing among the majority of the developers and the Linked Data infrastructures.
There also exist other approaches to facilitate developers the access to open data. For example,
TheDataTank (TDT) [25] is a tool to transform online data of different formats into a user-friendly
Restful API. MK:Smart [26] enables the collection, integration and usage of data from diverse sources
of city systems, including a data catalogue that supports the description and traceability of the data for
the generation of analytics. The data integration is extended with HyperCat [27], which is a solution to
integrate different IoT systems inside MK:Smart.
As in the case of our proposal, there exist other solutions focusing on providing a common
and integrated access to heterogeneous data producers. FI-WARE [28] is an open initiative with the
goal to create an ecosystem of ready-to use technologies, integrating sensor data and other sources.
Our platform focuses on solving data access and integration issues, particularly for developers when
creating their urban applications. Therefore, the IES Cities platform could be also a part of a broader
solution, as it solves problems not currently addressed by other solutions when easing the access and
contribution by app developers to smart cities data sources.
On the other hand, Ushahidi [29] represents a set of tools to democratize information, increase
transparency and lower barriers for individuals to share their stories. This platform enables easy
crowdsourcing of information through multiple channels, including SMS, email, Twitter and the web.
It includes Crowdmap [30], a tool that enables sharing a story on a map with the addition of photos
and videos. The IES Cities architecture also integrates data from different social sources accessible
through the query mapper by providing a common access mechanism similar to other data sources.
Finally, this work seeks citizen participation and contribution to a citys datasets or knowledge
base by enabling them to contribute with data through their smartphones urban apps. Consequently,
some important concerns around user-generated data are its provenance, trustworthiness and privacy
control. To deal with provenance, the W3C has developed the PROV Data Model [31]. This work
tackles trustworthiness by delegating data validation to the consuming app business logic but it
also introduces a data access control mechanism integrated into the query and update functionality
provided by the query mapper. Section 4.4 describes the flexible and fine-grained mechanism to control
access to published data in IES Cities.
sensor-full devices and a browser, i.e., hybrid app player, which nowadays almost every citizen
cities
carriesand
withthehim.
existing open government-related datasets. It envisages smartphones as both a sensor-full
devices and a
IES Cities browser, i.e., hybrid app
main contribution is toplayer,
devisewhich nowadaysplatform
a technological almost every citizen
to foster thecarries
developmentwith him. of
openIESdataCities
apps,main
which contribution
is consumable is tofrom
devise a technological
citizens smartphones. platform to foster the
This platform development
is under test in fourof
open dataEuropean
different apps, which is consumable
cities providing their fromcitizens
citizensthesmartphones.
opportunity This to getplatform
the most is out
under test incities
of their four
different European cities providing their citizens the opportunity to get the
data. Our assumption is that urban apps will be assembled from structured and non-structured data most out of their cities data.
Our
in theassumption
form of RDF, is that
CSV, urban
XMLapps willscrapped
or even be assembled
HTML from structured
pages. However, andinformation
non-structured datamodels
in such in the
form
should of be
RDF, CSV, XML
mapped or evena scrapped
into JSON, HTMLfranca
de facto lingua pages.forHowever,
web andinformation in such models
mobile developers should
and, therefore,
be mapped
truly into JSON,
promoting a de facto lingua
the consumption of franca
urban for web and
datasets. Formobile
that developers
reason, theand, therefore,
platform truly
provides
promoting
functionality thetoconsumption
answer all user of urban
queries datasets.
in JSON For that reason,
format, the platform
independently provides
of the type of functionality
the underlying to
answer
dataset. all user queries in JSON format, independently of the type of the underlying dataset.
IES Cities accelerates the development and deployment of new urban services that exploit city
knowledge in inthe
theform
formofof several
several heterogeneous
heterogeneous datasets
datasets suchsuch as open
as open government
government repositories,
repositories, user-
user-generated data through smartphone apps or even social data.
generated data through smartphone apps or even social data. The platform manages the The platform manages the whole
lifecycle of city-related data and provides functionalities
functionalities to to exploit
exploit that knowledge
knowledge in a uniform, web
developer-friendly manner, i.e., through the SQL query language and the JSON format to exchange exchange
data between urban urban apps
apps andand the
the platforms
platforms back-end.
back-end.
Furthermore, the platform annotates user-generated data with provenance provenance information
information and
includes fine-grained
fine-grainedaccessaccesscontrol
controlmetadata
metadata for datasets, enforcing quality assurance,
for datasets, enforcing quality assurance, reliability reliability
and
and security.
security. The goalThe is
goal is to ensure
to ensure that user-generated
that user-generated dataenrich
data truly truly and
enrich and complement
complement existing existing
council-
council-provided
provided data. Besides, data. Besides,
the platformthe platform
provides provides
support to support
host app to specific
host app specificsodatasets
datasets that theso that
whole
the whole presentation
presentation and business and business
logic can relylogic canclient-side
on the rely on thewhileclient-side while
delegating alldelegating all the data
the data manipulation
manipulation
responsibility responsibility
to the platform. to the platform.
Figure 1.
Figure 1. Diagram of the
Diagram architecture
of the of theofIES
architecture theCities platform
IES Cities showingshowing
platform relationsrelations
among different
among
components.
different components.
The Client Layer is composed of the IES Cities Player and the Web Interface. The IES Cities Player
The Client Layer is composed of the IES Cities Player and the Web Interface. The IES Cities Player is
is a mobile application that serves as the main entry point to browse through, review and select those
a mobile application that serves as the main entry point to browse through, review and select those
urban services run in a users smartphone. On the other hand, the Web Interface (http://iescities.com/)
urban services run in a users smartphone. On the other hand, the Web Interface (http://iescities.com/)
allows different actions depending on the type of user (citizens, city councils and service developers),
allows different actions depending on the type of user (citizens, city councils and service developers),
e.g., to browse, search, review and access to applications, datasets and usage stats being generated
e.g., to browse, search, review and access to applications, datasets and usage stats being generated by
by the platform. Panel (a) of Figure 2 shows a capture of this web interface.
the platform. Panel (a) of Figure 2 shows a capture of this web interface.
Sensors 2016, 16, 1022 6 of 25
Sensors 2016, 16, 1022 6 of 24
(a) (b)
Figure
Figure2.2.(a)
(a)Capture
CaptureofofIES
IESCities
Citiesmain
mainweb
webpage;
page;(b)
(b)RESTful
RESTfulAPI
APIdescribed
describedusing
usingSwagger.
Swagger.
The Business Layer is responsible for the management of the main entities handled by the
The Business Layer is responsible for the management of the main entities handled by the solution,
solution, namely councils, apps, datasets, users and the gathering of usage statistics. IES Cities aims
namely councils, apps, datasets, users and the gathering of usage statistics. IES Cities aims to host the
to host the urban apps and datasets ecosystem of a city. It promotes the usage and consumption of
urban apps and datasets ecosystem of a city. It promotes the usage and consumption of such apps
such apps and datasets by citizens, enterprises and the council itself. The platform exposes all the
and datasets by citizens, enterprises and the council itself. The platform exposes all the functionality
functionality through a RESTful API (http://iescities.com/IESCities/swagger/) that groups operations
through a RESTful API (http://iescities.com/IESCities/swagger/) that groups operations in the
in the following categories:
following categories:
Entities, which offers CRUD operations to deal with the main entities tackled by the project.
Logging
Entities,module
which that
offers CRUDserver-side
enables operationscomponents
to deal withto the main entities
register diversetackled by the project.
events associated to apps
life
Loggingcycle (e.g.,
module that AppStart,
enables server-side componentsand
AppProsumer so on),
to register player
diverse eventsinteractions
associated to(e.g.,
apps
life cycle (e.g., AppStart
PlayerAppSearch), or, dataset-related
AppProsumer and (e.g.,
so DatasetRegistered).
on), player interactions (e.g., PlayerAppSearch), or
Social API that provides
dataset-related common search ).
(e.g., DatasetRegistered methods for applications to obtain data from different
social
Social networks (e.g., Facebook
API that provides common andsearch
Twitter).
methods for applications to obtain data from different
Data
socialmanagement related
networks (e.g., methods,
Facebook andwhich contains methods to enable the query and insertion of
Twitter).
data
Data through
managementSQL. Panelmethods,
related (b) of Figure
which2 contains
shows the Swagger-generated
methods web interface
to enable the query that
and insertion
documents the IESSQL.
of data through Cities RESTful
Panel (b) ofAPI.
Figure 2 shows the Swagger-generated web interface that
documents
Finally, the the IESLayer
Data Citiesallows
RESTful API. different heterogeneous datasets, e.g., ideally those
accessing
modelled as 5-star linked data (e.g., rdf linked sources that are queried using sparql) but also other
Finally, the Data Layer allows accessing different heterogeneous datasets, e.g., ideally those
open data available in less rich formats such as CSV or JSON files, in an homogeneous manner. In
modelled as 5-star linked data (e.g., rdf linked sources that are queried using sparql) but also other
order to make this possible, when a data provider registers a new dataset, it has to describe the
open data available in less rich formats such as CSV or JSON files, in an homogeneous manner. In order
contents by including a description of its data format and other related access details. In addition, the
to make this possible, when a data provider registers a new dataset, it has to describe the contents by
Data Layer also supports direct connection with different external database management systems,
including a description of its data format and other related access details. In addition, the Data Layer
which contain legacy data that providers want to offer as open data.
also supports direct connection with different external database management systems, which contain
legacy data that providers want to offer as open data.
3.2. IES Cities Operation
3.2. IES
The Cities
common Operation
modus operandi of the platform is as follows:
1. The dataset
The modus operandi
commonprovider of the platformregisters
(e.g., municipality) is as follows:
within the IES Cities server its datasets
descriptions, by means of a web form or using an automatic process through the IES Cities API
1. The dataset provider (e.g., municipality) registers within the IES Cities server its datasets
in those cases that a large number of datasets needs to be registered. For each dataset the
descriptions, by means of a web form or using an automatic process through the IES Cities
provider indicates where the dataset can be located and accessed (URI), what is the original
API in those cases that a large number of datasets needs to be registered. For each dataset the
format of the data (CSV, JSON RDF, etc.) and, in in the case of RDF datasets, a description that
provider indicates where the dataset can be located and accessed (URI), what is the original
correlates and maps the data from its semantic (RDF) representation to its relational view.
format of the data (CSV, JSON RDF, etc.) and, in in the case of RDF datasets, a description that
2. Developers discover which datasets are available by browsing or searching in the IES Cities
correlates and maps the data from its semantic (RDF) representation to its relational view.
dataset repository and decides which ones best fit their application. Through a RESTful JSON-
based API, they obtain the description of the datasets structure, using a relational view of the
connected data source (i.e., tables and columns).
Sensors 2016, 16, 1022 7 of 25
2. Developers discover which datasets are available by browsing or searching in the IES Cities
dataset repository and decides which ones best fit their application. Through a RESTful
JSON-based API, they obtain the description of the datasets structure, using a relational view of
the connected data source (i.e., tables and columns).
Sensors 2016, 16, 1022 7 of 24
3. After obtaining the description, the developer can start issuing queries, retrieving results in JSON
3.format,
Afterand updates
obtaining theusing SQL sentences.
description, Datasets
the developer can have
can start different
issuing queries,permissions to access
retrieving results in the
query and update operations, see Section 4.4.
JSON format, and updates using SQL sentences. Datasets can have different permissions to
4. Once access the query andhas
the application update
beenoperations, see Section 4.4.
finished, developers can register it within the platform through
4.a web
Once the application
form, providinghas been finished,
among developers
other details where can
theregister it withinisthe
application platform through
accessible a
for download
web form, providing among other details where the application is accessible for download (URI
(URI in the application store) and related description, including snapshots and geographical
in the application store) and related description, including snapshots and geographical
information. The IES Cities Player offers this information when showing applications to the users
information. The IES Cities Player offers this information when showing applications to the
based on based
users their position or theirorsearches.
on their position their searches.
5. 5. Finally, end-users, i.e., citizens,with
Finally, end-users, i.e., citizens, with the helpof
the help ofthe
theIES
IESCities
Cities player
player app, app, browse
browse or search
or search for for
available
availableregistered
registered urban
urbanapps
appsaccording
according toto their
theirlocation
location and
and interests.
interests.
Figure
Figure 3 depicts
3 depicts a graphicalrepresentation
a graphical representation of
of the
theIES
IESCities
Citiesoperation
operationincluding all the
including allsteps of
the steps of
the data publication and consumption process.
the data publication and consumption process.
Figure
Figure 3. Main
3. Main operationofofthe
operation theIES
IES Cities
Cities platform
platformshowing:
showing:(1) registration of datasets;
(1) registration (2) search
of datasets; (2)ofsearch
datasets; (3) data query using SQL; (4) application registration; (5) app discovery through
of datasets; (3) data query using SQL; (4) application registration; (5) app discovery through IES IES Cities
Player.
Cities Player.
4. The Query Mapper
4. The Query Mapper
The Query Mapper components underlying philosophy is to create a relational view for all
The Query
mapped dataMapper
sources, components underlying
allowing the execution philosophy
of queries is toancreate
to retrieve updatea the
relational view
data. The mainfor all
mapped data sources,
functionality allowing
of this module, the execution
exposed by the IESofCities
queries
API,to retrieve
is the an update the data. The main
following:
functionality of this module, exposed by the IES Cities API, is the following:
Querying using SQL. The query mapper provides methods to query a dataset using the SQL
query language.
Querying using SQL.TheTheregistered datasetprovides
query mapper must be methods
compatible towith these
query types ofusing
a dataset queries,
the which
SQL query
requires being transformable by the query mapper to a relational view (JSON,
language. The registered dataset must be compatible with these types of queries, which CSV or SPARQLrequires
datasets).
being transformable by the query mapper to a relational view (JSON, CSV or SPARQL datasets).
Querying using SPARQL. The query mapper also provides support for querying datasets using
Querying using
SPARQL. In SPARQL. The datasets
this case, only query mapper
mapped also provides
to SPARQL datasupport
sources for
are querying
compatibledatasets
with thisusing
SPARQL. In this
language. More case, only datasets
experienced users canmapped
use thistofunctionality
SPARQL data sources
to access dataare compatible
sources modelled with
as this
language. More
RDF graphs. experienced users can use this functionality to access data sources modelled as
RDF
graphs.
Update using SQL. This module also contains functionality to insert or remove data through SQL
from
Update the SQL.
using registered datasets,also
This module thanks to the
contains query mapper,
functionality which
to insert or fully
removesupports CRUD SQL
data through
fromoperations over datasets.
the registered datasets, thanks to the query mapper, which fully supports CRUD operations
Obtain datasets structure. The query mapper transforms connected data sources (JSON, CSV) to
over datasets.
a relational storage that can be accessed (queried or updated) using SQL sentences. Therefore,
Sensors 2016, 16, 1022 8 of 25
Obtain datasets structure. The query mapper transforms connected data sources (JSON, CSV) to
a relational
Sensors storage
2016, 16, 1022 that can be accessed (queried or updated) using SQL sentences. Therefore, 8 of 24
developers require discovering the relational view (i.e., tables and columns) of the registered
developers
datasets, require
in order discovering
to start using themthe relational view (i.e., tables and columns) of the registered
in their applications.
datasets, in order to start using them in their applications.
In Inaddition
additiontotothetheprevious
previouscore
core functionality,
functionality, the query mapper
the query mapperalso alsoprovides
providesother
other utility
utility
methods
methods in its REST API. These methods perform different functions with the data: (a) obtain thethe
in its REST API. These methods perform different functions with the data: (a) obtain
results
resultsin JSON-LD
in JSON-LD format,
format,which is only
which supported
is only forfor
supported datasets containing
datasets containingsemantic
semanticinformation
information (i.e.,
connected SPARQL endpoints), or perform updates in JSON format, instead of using
(i.e., connected SPARQL endpoints), or perform updates in JSON format, instead of using SQL. This SQL. This way,
theway,
query themapper provides
query mapper a more acoherent
provides data API:
more coherent dataasAPI:
the API allows
as the retrieving
API allows data in
retrieving JSON
data format,
in JSON
it also provides
format, it also methods
provides to push data
methods in JSON
to push format
data in JSONand theand
format querythemapper will transform
query mapper it to the
will transform
it to the corresponding
corresponding SQL updateSQL update
or insert or insertFigure
sentences. sentences. Figure
4 shows 4 shows
a screen a screen
capture of thecapture of the
invocation of a
invocation of a data access
data access method using Swagger. method using Swagger.
Figure4.4.RESTful
Figure RESTfulAPI
API invocation
invocation through
through the
theSwagger
Swaggerinterface.
interface.
4.1.4.1. Connecting
Connecting DataSources
Data Sources
Datasets registered in the IES Cities platform require a mapping description to connect with the
Datasets registered in the IES Cities platform require a mapping description to connect with the
data sources. Mappings, described using JSON, specify different attributes, depending on the type of
data sources. Mappings, described using JSON, specify different attributes, depending on the type
the connected dataset. The query mapper uses this attributes to create the relational view of the
of the connected dataset. The query mapper uses this attributes to create the relational view of the
connected source. However, all mappings contain, at least, the type of the connected source.
connected source. However, all mappings contain, at least, the type of the connected source. Currently,
Currently, the supported data sources are the following: SPARQL, JSON and CSV, and relational
thedatabases.
supported data sources are the following: SPARQL, JSON and CSV, and relational databases.
LetLet
consider the
consider theJSON
JSONfile
filefragment,
fragment, showed
showed inin Figure
Figure5,5,as
asan
anexample
exampleofof the
the mapping
mapping process.
process.
TheThefigure shows an excerpt extracted from a publicly available data source in JSON format,
figure shows an excerpt extracted from a publicly available data source in JSON format, providedprovided
bybythethe
Zaragoza city
Zaragoza citycouncil,
council,one
oneofofthe
thepartners
partners of
of the IES Cities
the IES Citiesproject.
project.The
Thefull
fulldocument,
document, which
which
contains details about accommodation options in the city of Zaragoza, is available for download at
the following URL https://www.zaragoza.es/api/recurso/turismo/alojamiento.json.
Sensors 2016, 16, 1022 9 of 25
contains details about accommodation options in the city of Zaragoza, is available for download at the
following
Sensors 2016,URL https://www.zaragoza.es/api/recurso/turismo/alojamiento.json.
16, 1022 9 of 24
On the
On theother
otherhand,
hand, Figure
Figure 6 shows
6 shows the the
JSON JSON description
description required
required by theby the mapper
query query mapper to
to register
register the previous dataset within the IES Cities platform. The description contains different
the previous dataset within the IES Cities platform. The description contains different attributes that
attributes
provide thethat provide the
information information
required by therequired by the in
query mapper query mapper
order in order
to create to createrepresentation
the relational the relational
representation
of of the data source.
the data source.
The mapping
The mapping field
field instructs
instructs the query
the query mapper
mapper that
that the thedata
input input datais source
source of JSONistype.
of JSON type.
Supported
Supported types are csv, json, sparql
types are csv, json, sparql and database. and database.
The uri
The fieldpoints
urifield pointstotothe
thesource
source URL
URL where
where thethe
filefile containing
containing thethe data
data is located
is located (local
(local filesfiles
are
are also supported).
also supported).
The root field defines the JSON key field where the mapping process starts. In the case of JSON
The root field defines the JSON key field where the mapping process starts. In the case of JSON
data sources, there could exists multiple nested objects. This field allows specifying a route
data sources, there could exists multiple nested objects. This field allows specifying a route within
within the JSON data file where the data is contained.
the JSON data file where the data is contained.
The key field selects the attribute from the set of mapped objects used as the primary key for
The key field selects the attribute from the set of mapped objects used as the primary key for the
the main relational table generated from this JSON.
main relational table generated from this JSON.
The refresh field defines the interval, in seconds, to retrieve the new data and update the
The refresh field defines the interval, in seconds, to retrieve the new data and update the internal
internal structures, since JSON files can change periodically.
structures, since JSON files can change periodically.
The optional table field indicates the name of the main relational table generated by the
The optional
mapping table field indicates the name of the main relational table generated by the mapping
process. If not specified, the mapper names the table as the value of the key specified
process. If not
in the root field. specified, the mapper names the table as the value of the key specified in the
root field.
In the previous example, in order to give place to a 1st normal form-based relational database,
the mapper will produce two additional tables from the given JSON contents, namely,
hotel_geometry and hotel_geometry_coordinates. The query mapper applies this process
iteratively if the JSON data file contains multiple nested object, meaning that the query mapper will
create different intermediate tables as needed to represent the data in a normalized relational view.
Sensors 2016, 16, 1022 10 of 25
In the previous example, in order to give place to a 1st normal form-based relational database, the
mapper will produce two additional tables from the given JSON contents, namely, hotel_geometry
and hotel_geometry_coordinates. The query mapper applies this process iteratively if the JSON data
file contains multiple nested object, meaning that the query mapper will create different intermediate
Sensors
tables 2016, 16, 1022 to represent the data in a normalized relational view.
as needed 10 of 24
Sensors 2016, 16, 1022 10 of 24
Figure
Figure 6. JSON
JSON mapping
mapping to
to register
register accommodation
accommodation JSON dataset in IES Cities.
Figure 6.
6. JSON mapping to register accommodation JSON
JSON dataset
dataset in
in IES
IES Cities.
Cities.
The
The query
query mapper
mapper extracts
extracts the
the JSON
JSON dataset
dataset and
and creates
creates tables
tables representing
representing the
the same
same data
data but
but
with a relational structure.
with a relational structure. The query mapper applies an extraction procedure to create the
structure. The query mapper applies an extraction procedure to create the relational relational
view
view of
view of the
ofthe connected
theconnected data
data
connected source.
source.
data Figure
Figure
source. 77 shows
7 shows
Figure the
the resulting
the resulting
shows table
table structure,
table structure,
resulting obtainedobtained
structure, after
after applying
obtained after
applying
the the
previous
applying previous
the procedureprocedure to the data
to the datatocontained
previous procedure contained in
in the dataset
the data contained the dataset
shown
in the shown
in Figure
dataset shown 5.in Figure 5.
in Figure 5.
Figure 7. Data serialized in JSON format obtained from the execution of the SQL select statement.
Figure
Figure 7.
7. Data
Data serialized
serialized in
in JSON
JSON format
format obtained from the
obtained from the execution
execution of
of the
the SQL
SQL select
select statement.
statement.
The
The procedure
procedure used
used toto extract
extract the
the structure
structure from
from aaa JSON
JSON object
object and
and create
create aaa relational
relational structure
structure
worksThe
in procedure
the used
following to
way: extract the structure from JSON object and create relational structure
works in
works in the
the following
following way:
way:
The
The list
list of
of objects,
objects, specified
specified by
by the root property
the root property of of the
the mapping,
mapping, isis iterated
iterated and
and the
the query
query
mapper converts all literal attributes (strings and numbers) to their relational
mapper converts all literal attributes (strings and numbers) to their relational equivalent equivalent
representation.
representation. TheThe query
query mapper
mapper applies
applies aa finer
finer data
data type
type detection
detection to
to check
check ifif those
those string
string
Sensors 2016, 16, 1022 11 of 25
The list of objects, specified by the root property of the mapping, is iterated and the query mapper
converts all literal attributes (strings and numbers) to their relational equivalent representation.
The query mapper applies a finer data type detection to check if those string values contained
in the JSON are representing dates or, otherwise, if numeric values are representing integer or
float types.
Each object maps to a table, whose schema contains all literal properties as columns with the
detected type. For example, the table hotel and the columns created to store the data of all the
related objects. As in our approach the keys of the original JSON data are used as the column
names of the resulting tables, we escape the name of the columns as they can contain spaces or
other special characters not directly allowed as column names in SQLite. As JSON keys cannot
directly include a double quote character without escaping it, the column name does not require
further processing.
For those properties that do not have a literal type but contain an array of other objects, the
process creates a new child table by applying a recursive process. This child table connects to its
parent using a special key column added to the child table. For example, tables hotel_geometry
and hotel_geometry_coordinates, which contain intermediate data.
If the JSON property contains an array of literal values, the process also creates a new table
containing a single column, using the same foreign key mechanism explained before to connect
the newly created table with the one containing the parent object. In the example, this is the case
of the column coordinates of table hotel_geometry_coordinates.
The extraction process results in a 1st Normal Form (1NF) database because all related
data is contained in a separate table (e.g., hotel_geometry or hotel_geometry_coordinates),
eliminating repetition groups in tables (e.g., this is the case of the column coordinates in the
hotel_geometry_coordinates table) and using primary keys univocally identifying each row.
The mapping description allows, to the administrator that is connecting the data source, selecting
which column or columns constitute the primary key of the table. If the data does not have a specific
primary key for the data included in the table, the query mapper enables to create an ad-hoc numeric
identifier for each column.
There is no way to ensure that the data contained in table is in 2NF (and therefore, neither in
3NF) Normal Form because it is not possible for the query mapper to extract the semantics of the
table and infer which columns have a dependency relationship among them in order to define and
populate new tables. Only in those cases where the original dataset contains a structure already
partially normalized the process could result in a 2nd Normal Form relational database after the
mapping process. This means that the query mapper cannot improve the structure of a dataset but
only to provide an easier access to the data it contains through a common query mechanism (i.e., with
the execution of SQL queries and updates).
The translation of the JSON structure to a 1NF Relational model disaggregates the information
contained in a JSON object into multiple columns. In our approach, instead of storing aggregated
data [1214], we create specific tables that store each object in a different row and its properties in
different columns, using the name of the keys and their inferred type to create them. Therefore, the
translation results in multiple CREATE TABLE statements, one per required table, defining the name
and type of its columns. Finally, the extractor populates the tables with the information contained in
the JSON object. Once the process creates and extracts the data, users can execute SQL queries and
updates directly on the resulting tables.
On the other hand, the permissions field manages the access control, configurable for each
operation, i.e., select, delete, insert and update. The query mapper applies this control on every
table provided by the relational view of a dataset. In the case of the example, for the main table hotel,
everybody can select its contents; however, for the table hotel_geometry the query mapper will only
allow those users identified as user1 and user2 to perform a select operation. For those operations
not declared in the permission section, the access defaults to NONE, which implies that nobody can
Sensors 2016, 16, 1022 12 of 25
delete, update or insert contents into these tables using the RESTful API. This makes sense for this
example dataset, since the council regulates it. However, for app-specific datasets, some users could
have access for contributing with their information and update the dataset. Section 4.4 provides more
information about the permission management feature provided by the query mapper.
Finally, Figure 8 shows the result of querying the hotels JSON dataset mapped earlier through a
Sensors 2016, 16,
SQL query and1022
the results of such query returned in JSON format. 12 of 24
Figure 8.
Figure Data serialized
8. Data serialized in
in JSON
JSON format
format obtained
obtained from
from the
the execution
execution of the SQL select statement.
The
The query
query mapper
mapper transforms
transforms the the data
datainto
intodifferent
differenttables
tablesininorder
orderto toobtain
obtainaanormalized
normalizedview. view.
Therefore, consumers could require joining them using the specific identifiers
Therefore, consumers could require joining them using the specific identifiers generated by the query generated by the query
mapper duringthe
mapper during theextraction
extraction process.
process. For For example,
example, in thisincase,
this the
case, the intermediate
intermediate tables havetables have
reference
reference to the
to the parent parent
tables tables
thanks to athanks
column to called
a column called
parent _id,parent_id,
which was not which wasinnot
present thepresent
originalindata
the
original data and it is required to join the different tables of the relational view.
and it is required to join the different tables of the relational view. This way, users can obtain the This way, users can
obtain the information
information contained in contained
the JSONinstructure
the JSON bystructure
connecting by the
connecting
created tablesthe created
using thetables using the
intermediate
intermediate
columns created columns created
to store to store the disaggregated
the disaggregated data. For example,data. For example,
in Figure in query
8, the Figureretrieves
8, the query
part
retrieves part of the information disaggregated into multiple tables.
of the information disaggregated into multiple tables. As the example shows, the query canAs the example shows, the query
retrieve
can
the retrieve
data withthea data with different
structure a structure todifferent to the
the original original
JSON, i.e., JSON, i.e., theJSON
the geometry geometry object JSON object
is missing in
is missing in Figure 8, depending on how the query selects and connects
Figure 8, depending on how the query selects and connects the tables and their columns. the tables and their columns.
Although,
Although, ourour approach
approachrequires
requiresusers userstotomanually
manuallyconnect
connect thethe tables
tables andand it allows
it allows retrieving
retrieving the
the
datadata with
with a different
a different structure
structure to to the
the originalone,
original one,ititprovides
providesdevelopers
developersaa greater greater flexibility
flexibility when
when
selecting
selecting the
the specific
specific data
data consumed
consumed by by the
the applications.
applications.
4.2.
4.2. Prosuming
Prosuming Data
Data
The
The IES
IES Cities
Cities platform
platform allows users to query
query datasets
datasets and
and to
to contribute
contribute with
with data, thanks to the
the
functionality providedby
functionality provided bythe
the query
query mapper.
mapper. In both
In both cases,cases, the can
the users users can heterogeneous
access access heterogeneous
datasets
datasets (JSON,
(JSON, CSV, CSV, SPARQL
SPARQL and relational
and relational DBs) usingDBs)
theusing the SQL
SQL query query language.
language. Every timeEvery time the
the platform
platform connects with a new dataset, the query mapper extracts its information and creates
connects with a new dataset, the query mapper extracts its information and creates a relational view of a
relational view of
the data, which is,the data, which
henceforth, is, henceforth,
queriable queriable
using SQL. using SQL.
The platform Thethe
extracts platform extracts theusing
data periodically, data
periodically, using the configured interval, in order to maintain an up-to-date version of the extracted
information for those external datasets whose source has changed.
As the data extracted by the query mapper, when connecting with JSON and CSV datasets, is
backed by a SQLite database, it can solve any SQL query supported by this database management
system: COUNT, DISTINCT, WHERE, JOIN, LIMIT, etc.
In our proposal, we have covered two different scenarios when users want to contribute with
Sensors 2016, 16, 1022 13 of 25
the configured interval, in order to maintain an up-to-date version of the extracted information for
those external datasets whose source has changed.
As the data extracted by the query mapper, when connecting with JSON and CSV datasets, is
backed by a SQLite database, it can solve any SQL query supported by this database management
system: COUNT, DISTINCT, WHERE, JOIN, LIMIT, etc.
In our proposal, we have covered two different scenarios when users want to contribute with
their own data: (a) contribution to an existing dataset and (b) creation of a new dataset from scratch.
We think that these two scenarios cover the different use cases that are required for an open data
platform oriented to the creation of urban applications. While the former case is useful to map and
access already existing data sources, the latter allows users to easily create datasets that are specific for
their applications but, in addition, that also could be useful for other consumers, if configured with
query permissions for other users.
JSON and CSV datasets. The query mapper transforms external datasets provided in these formats,
as explained in Section 4.1, to a relational representation by creating a local database, one for each
dataset, and downloading the data. Once the mapper creates the relational view of the dataset and
downloads the data, the dataset can be queried using SQL sentences answered by the relational
representation of the data. By using this approach, we are assuming that JSON and CSV datasets
are not directly updateable as data servers usually provide them as downloadable files.
SPARQL datasets. This type of external datasets is updateable if the endpoint is accepts SPARUL
sentences. In those cases, the query mapper transforms the SQL updates to a corresponding
SPARUL statement sent and executed by the endpoint. The IES Cities platform does not maintain
any data and it sends all the queries and updates, once translated to SPARQL or SPARUL,
respectively, to the corresponding endpoint.
Relational datasets. In this case, the connected dataset is already available in a relational
representation. Therefore, there is no need to translate the SQL queries and updates, but directly
execute them against the connected database. Therefore, the IES Cites platform only acts as
gateway to the data, which can be useful for those legacy datasets that some publishers wants to
transform to open datasets. Database administrators can create relational views and/or apply the
access mechanism introduced by the query mapper to control how the users access and update
the data through the IES Cities platform.
In the case of JSON and CSV datasets, which are not directly updateable, user-contributed data
is stored in its own internal database, one for each connected dataset. Therefore, the platform has
two internal sources of information that can solve each query: one containing original information
and the other one with the data contributed by the users. The user data storage has the same data
structure (i.e., tables and columns) that the one containing the original data, but focused on storing
users contributions.
When a user sends a query to a dataset that is split into two storages (original data and user
contributions), the query mapper executes the query in both internal storages and merges the data to
build a unique response for the user, as depicted in Figure 9. Furthermore, the user can select if the
query executes in the original data storage, in the user contributed data storage or simultaneously in
Sensors 2016, 16, 1022 14 of 25
both storages. In the latter case, the query mapper merges responses obtained from the two storages
Sensors 2016, 16, 1022 14 of 24
into a single one and returns it to the consumer.
Figure 9.
Figure Data queries
9. Data queries performed
performed on
on the
the two
two internal
internal storages:
storages: original
original data
data and
and user
user data.
data.
Figure 10.
Figure 10. Example
Example of
of user
user created datasets description.
created datasets description.
4.3. Data
4.3. Data Quality
Quality
The IES
The IES Cities
Citiesplatform
platformprovides
providestwo
two quality
quality assurance
assurance mechanisms
mechanisms for for datasets:
datasets: quality
quality level level
and
and verified
verified by: by:
Quality level
Quality level is
is aa numeric
numeric property
property that
that assess
assess the
the dataset
dataset conformance
conformance with
with one
one of of the
the levels
levels of
of
the 5-star
the 5-star rating
rating for
for open
open data
data (http://5stardata.info/).
(http://5stardata.info/). ForFor
external datasets,
external thisthis
datasets, indicator cancan
indicator be
automatically computed from the information contained in the mapping that
be automatically computed from the information contained in the mapping that connects to the connects to the
external dataset
external dataset(JSON,
(JSON,CSV,
CSV, SPARQL,
SPARQL, andand relational
relational DB),DB),
whichwhich correspond
correspond with different
with different levels
levels of the 5-star classification.
of the 5-star classification.
Verified by tells consumers whether the dataset and its data have been verified to fulfil the quality
Verified by tells consumers whether the dataset and its data have been verified to fulfil the quality
standards defined by its publisher or administrator. This field is useful for data consumers to
standards defined by its publisher or administrator. This field is useful for data consumers
decide if the dataset has the quality required for their purposes. In the case of public
to decide if the dataset has the quality required for their purposes. In the case of public
administrations, some manager, on behalf of the public entity, will usually be responsible for
administrations, some manager, on behalf of the public entity, will usually be responsible for
verifying published datasets.
verifying published datasets.
The IES Cities platform manages datasets original data and users inserted data separately, as
The IES Cities platform manages datasets original data and users inserted data separately, as
explained in Section 4.2. Because users provided data cannot always guarantee the same quality
explained in Section 4.2. Because users provided data cannot always guarantee the same quality
standards than the one directly provided by original sources, data separation allows data consumers
standards than the one directly provided by original sources, data separation allows data consumers
to query original data and user data separately.
to query original data and user data separately.
In addition, user contributed data could be integrated by the dataset administrator, after
In addition, user contributed data could be integrated by the dataset administrator, after applying
applying the proposed validation process, into the original dataset, supposing that the user generated
the proposed validation process, into the original dataset, supposing that the user generated data
data fulfils the quality standards defined by the publisher of the dataset. Regarding validation of
datasets published within IES Cities, we have defined the following process, also depicted in Figure 11:
Sensors 2016, 16, 1022 16 of 25
fulfils the quality standards defined by the publisher of the dataset. Regarding validation of datasets
published within IES Cities, we have defined the following process, also depicted in Figure 11:
1. Dataset publishers provide a validation schema for their datasets. We use JSON Schema
(http://json-schema.org/)
Sensors 2016, 16, 1022 as all the datasets are accessible in JSON format using 16the query
of 24
mapper and, therefore, we can apply this mechanism to validate them. JSON Schema allows
1. Datasetthe
to declare publishers provide
properties, data atypes
validation schema for atheir
and restrictions JSON datasets.
document We must
use JSON
fulfilSchema
in order to
(http://json-schema.org/)
be valid. as all the datasets are accessible in JSON format using the query
mapper and, therefore, we can apply this mechanism to validate them. JSON Schema allows to
2. In addition, publishers define a query that the validator will use to retrieve the data stored in the
declare the properties, data types and restrictions a JSON document must fulfil in order to be valid.
dataset in JSON format. This query could retrieve all the datasets data or only a part of it for its
2. In addition, publishers define a query that the validator will use to retrieve the data stored in
validation, adding
the dataset more
in JSON flexibility
format. to thecould
This query validation
retrieveprocess.
all the datasets data or only a part of it for
3. An automatic
its validation,process
addingexecutes the provided
more flexibility query onprocess.
to the validation the corresponding dataset and validates
the results
3. usingprocess
An automatic the associated JSON
executes the Schema,
provided queryobtaining whether the
on the corresponding data and
dataset conforms
validates to the
the resultsspecified
requirements using theby associated JSON or
the publisher Schema,
not. obtaining whether the data conforms to the
4. If therequirements
data passes specified by the process,
the validation publishertheor not.
validator updates the datasets Verified by property
4. If the data passes the validation process, the validator updates the datasets Verified by
accordingly. On the other hand, if the data do not comply with the specified validation schema,
property accordingly. On the other hand, if the data do not comply with the specified validation
the administrator will need to apply some correction procedures: e.g., fix or discard erroneous
schema, the administrator will need to apply some correction procedures: e.g., fix or discard
dataerroneous
before accepting it into the main dataset storage.
data before accepting it into the main dataset storage.
The validator
The validator executes
executes this
this processmultiple
process multiple times
times in
inorder
ordertotoassure thethe
assure quality of the
quality of data after after
the data
any change or insertion into the datasets storage. In addition, if configured, the insertion mechanism
any change or insertion into the datasets storage. In addition, if configured, the insertion mechanism
rollbacks any change that do not conforms to the required data schema. Using this mechanism,
rollbacks any change that do not conforms to the required data schema. Using this mechanism, dataset
dataset administrators can be sure that all the inserted data fulfil a minimum data standard related
administrators can beand
with its structure sure thatrestrictions.
value all the inserted data fulfil a minimum data standard related with its
structure and value restrictions.
4.4. Data Access Control
4.4. Data Access Control
The IES Cities platform provides an access control mechanism that allows publishers to manage
The
howIES Cities
users platform
contribute provides
to the datasets.an
Asaccess control
the users mechanism
execute that allows
SQL sentences publishers
on datasets, to manage
redirected by
the query mapper onto the corresponding relational view of the dataset, there are four different
how users contribute to the datasets. As the users execute SQL sentences on datasets, redirected by the
queryactions
mapperconfigurable on a dataset, which
onto the corresponding correspond
relational viewto of
thethe
usual SQL statements:
dataset, SELECT,
there are four INSERT,
different actions
UPDATE and DELETE. Each action supports one of the four access permissions currently
configurable on a dataset, which correspond to the usual SQL statements: SELECT, INSERT, UPDATE andsupported
by the platform:
DELETE. Each action supports one of the four access permissions currently supported by the platform:
ALL: any user can execute actions configured with this permission.
Sensors 2016, 16, 1022 17 of 25
ALL: any user can execute actions configured with this permission.
NONE: this permission expresses that users cannot execute the associated action. It is useful,
for example, to create read only datasets, which have all actions set to NONE except for the SELECT
action that can be set to ALL.
USER: only users contained in the specified access list can execute the associated action.
This permission can be useful for situations where only a group of users can select or update
some dataset.
OWNER: that only the owner of the inserted data can access or modify it, depending on the action
configured for the permission.
Whenever a user contributes with new data into a dataset, the query mapper also stores the
identifier of the user that inserted that information. As explained in Section 4.1, the IES Cities platform
creates relational representations of the connected datasets, by downloading the associated data to an
internal relational storage (JSON and CSV formats). During this process, the query mapper adds a new
column to each table it creates to store the id of the user who is contributing with new data. This is
the information used later by when applying the OWNER permission to an action. For example, if the
permission for UPDATE and DELETE only allows access to owners, it means that users can only modify or
remove those rows previously inserted by themselves and, therefore, have the corresponding user_id
column set to their own user.
1. Definition of a range of Key Performance Indicators (KPIs) regarding the types of users and
for the different apps uses. Some common KPIs defined across apps are number of downloads,
number of active users, etc.
2. Set-up of a range of data sources to feed the KPIs., which include the following:
a User questionnaires to ask users directly about their opinion and experience with
the application.
b Logging data from logs of events generated by the apps in use.
c Google Play, i.e., the marketplace where our apps available, to obtain usage statistics.
d In-app questionnaires periodically launched within the apps to gather usage feedback.
3. We also have performed a mapping of data sources to KPIs. From the available data sources, we
have collected and assigned values for the KPI variables.
Sensors 2016, 16, 1022 18 of 25
The preliminary evaluation results obtained by applying this methodology are demonstrating
that developed apps have a high degree of average acceptance (e.g., over 80%).
Sensors 2016, 16, 1022 18 of 24
The IES Cities platform demonstrates its capabilities to accelerate urban app development by
Thethat
the fact IESdevelopers
Cities platform demonstrates
have only to submit its SQLcapabilities to accelerate
queries through a RESTurbanAPI toapp the development
IES Cities Queryby
the fact that developers have only to submit SQL queries through a REST
Mapper. This component talks to the Zaragoza SPARQL endpoint and maps the results into JSON, API to the IES Cities
Query
withoutMapper. Thisfor
requiring, component talks to
the developer, theskills
the Zaragoza SPARQL endpoint
and knowledge and maps
to understand thethe results
syntax of into
the
JSON, without
SPARQL language. requiring, for the developer, the skills and knowledge to understand the syntax of the
SPARQL language.
During initialization, and after the user login, the app queries for available complaints and
During initialization,
suggestions, and displays the andresult
after on
the auser
map, login, the app
as shown queries
in panel (a)for
of available
Figure 12.complaints and
The requested
suggestions, and displays the result on a map, as shown in panel (a)
information is minimal in order to reduce network usage. Users submit new complaints and of Figure 12. The requested
information
suggestions by is simply
minimal in order
filling a formtotoreduce
which the network usage. Users
app automatically submit
attaches thenew complaints
location and
information,
suggestions by simply filling a form to which the app automatically attaches the
shown in panel (b) of Figure 12. On the other hand, users can review the most recent complains by location information,
shown
selectinginthe
panel (b) of Figuremenu
corresponding 12. On the other
option, shown hand, users(c)
in panel canofreview
Figurethe12. most recenton
By clicking complains
the markerby
selecting thethe
of a report, corresponding
app shows amenu option,ofshown
description in panel
the report and(c) of Figure
provides 12. By
access toclicking on the marker of
its full details.
a report, the app shows a description of the report and provides access to its full
From the data owners point of view, i.e., Zaragozas council, the enrichment of its datasets bydetails.
third parties presented some issues, e.g., the fact that the app cannot publish no previously approved
or the issue that there was no mechanism to control the quality and quantity of citizen added data.
In order to address this, IntelliSense techniques and other consolidation techniques, combined with
the one introduced in Section 4.3 in order to validate the user provided information.
Sensors 2016, 16, 1022 19 of 25
From the data owners point of view, i.e., Zaragozas council, the enrichment of its datasets by
third parties presented some issues, e.g., the fact that the app cannot publish no previously approved
or the issue that there was no mechanism to control the quality and quantity of citizen added data.
In order
Sensors to16,
2016, address
1022 this, IntelliSense techniques and other consolidation techniques, combined19with
of 24
the one introduced in Section 4.3 in order to validate the user provided information.
6. Evaluation
6. Evaluation
We have evaluated the performance of the query mapper module of the IES Cities platform.
We have evaluated the performance of the query mapper module of the IES Cities platform.
Particularly, we have focus on two of its tasks: extraction of registered datasets (JSON and CSV
Particularly, we have focus on two of its tasks: extraction of registered datasets (JSON and CSV
datasets) to create a relational view, and the data access process that retrieves the stored information
datasets) to create a relational view, and the data access process that retrieves the stored information
by executing or transforming the SQL sentences to the required format (i.e., SPARQL endpoints).
by executing or transforming the SQL sentences to the required format (i.e., SPARQL endpoints).
The query mapper and the IES Cities platform are implemented using Oracle Java 7 and Jersey
The query mapper and the IES Cities platform are implemented using Oracle Java 7 and Jersey
(https://jersey.java.net) to construct the RESTful API. During the experimentation, we have used Jetty
(https://jersey.java.net) to construct the RESTful API. During the experimentation, we have used Jetty
as the servlet container to serve the platform. The test platform runs an Ubuntu 15.10 on an Intel
as the servlet container to serve the platform. The test platform runs an Ubuntu 15.10 on an Intel
Quad Core 1.90 GHz and 8 GB of RAM. We have perform all experiments locally to minimize the
Quad Core 1.90 GHz and 8 GB of RAM. We have perform all experiments locally to minimize the
latency introduced by the network during the dataset download and response transferring. This way
latency introduced by the network during the dataset download and response transferring. This way
experiments can focus on those aspects that really depend on the characteristics of the proposed
experiments can focus on those aspects that really depend on the characteristics of the proposed
mapping process and not on external factors, such as the network load. In addition, the query mapper
mapping process and not on external factors, such as the network load. In addition, the query
internal scheduler, used to parallelize the dataset transformation to their relational views, uses 50
mapper internal scheduler, used to parallelize the dataset transformation to their relational views, uses
threads to run the data extraction jobs.
50 threads to run the data extraction jobs.
6.1. Dataset Extraction Process
6.1. Dataset Extraction Process
The first experiment has measured how the dataset extraction process behaves under different
The first experiment has measured how the dataset extraction process behaves under different
circumstances by changing the size of the datasets and their type. As shown in Figure 13, we have
circumstances by changing the size of the datasets and their type. As shown in Figure 13, we have
repeated the experiments using different dataset sizes ranging from 224 KB to 57 MB, which
repeated the experiments using different dataset sizes ranging from 224 KB to 57 MB, which
approximately duplicates the size of the connected dataset in each test.
approximately duplicates the size of the connected dataset in each test.
Figure
Figure 13. The graph
13. The graph shows
shows the
the time
time required
required by
by the
the query
query mapper
mapper to
to extract
extract aa dataset
dataset depending
depending on
on
its size and its format (CSV or JSON).
its size and its format (CSV or JSON).
Figure 13 shows that the extraction times, i.e., the process of converting the connected datasets
to their relational views is approximately 10 seconds for datasets under 7.1 MB, not only for JSON
datasets but also for CSV ones. However, the experiments also show that this time largely increases
for bigger datasets, especially for those available in CSV format.
The explanation for this difference in the scalability between the two formats is the way that the
Sensors 2016, 16, 1022 20 of 25
Figure 13 shows that the extraction times, i.e., the process of converting the connected datasets
to their relational views is approximately 10 seconds for datasets under 7.1 MB, not only for JSON
datasets but also for CSV ones. However, the experiments also show that this time largely increases for
bigger datasets, especially for those available in CSV format.
The explanation for this difference in the scalability between the two formats is the way that the
query mapper manages CSV datasets compared to JSON ones. In the first case, the query mapper
first transforms the CSV dataset to a JSON representation, by converting the table to a list of JSON
objects with the same properties contained in the CSV file. Then the query mapper transforms the
resulting JSON representation to the final relational database. In the current implementation, the
mapper writes the intermediate JSON file to the disk and then the JSON transforming process loads it,
creating the final representation. Therefore, it means that the data extraction to the database is not a
direct process. For this reason, the latency issues are going to be reduced in future implementations of
the CSV extractor process by eliminating this intermediate step during the transformation.
On the other hand, Table 1 contains the total time required by the platform to extract, in parallel,
multiple registered datasets. During its normal operation, the platform will extract or update the
data of various datasets, using the internal scheduler of the query mapper to launch the extraction
jobs. Obviously, these times will depend of the number of processor of the machine and the number
of configured threads. The numbers gathered in the table show that, in the worst case scenario
(500 datasets pending to be processed), the platform will finish to extract the data of all the registered
datasets in approximately 20 minutes with smaller size datasets (~250 KB) and 2,3 hours in the case of
bigger datasets (~3.7 MB).
Table 1. Total extraction times, in seconds, when the platform contains multiple registered datasets.
Although these times can seem a bit high, we should remark that they correspond to a situation
where the platform needs to process 500 datasets to extract their data. As explained in Section 4.1
the dataset mapping description contains a refresh parameter that controls how frequently a dataset
requires updating. Therefore, despite a platform having a high number of registered datasets, they
will not continuously require an update, meaning that the average load of the platform will be lower
than the one showed in Table 1, once the system has performed the initial start-up.
Finally, Table 2 summarizes the average extraction time per dataset in the same situation shown
in Table 1. As shown by the table, particularly in the case of bigger datasets, the number of configured
threads and the number of available processors causes the highest processing times. However, the
platform could always scale to process the extraction of a higher number of datasets if there is an
increase in the number of job processors, despite any other future improve or optimization in its
current implementation. In addition, the table also shows that these times are also acceptable for
processing data on a platform deployed in an environment with a more limited number of datasets
(up to 500) without requiring further investments to improve the hardware.
Table 2. Extraction time per dataset, in seconds, when the platform contains multiple registered datasets.
contained in the JSON file. Table 3 summarizes the results of the experiments. We have measured the
percentage of correct relational data extractions using our proposal. As the table shows, the procedure
has not been able to extract the information in all the cases. However, after inspecting the errors,
we have detected that the failing extractions are due to problems in the original data sources, which
provide invalid JSON or CSV files (e.g., invalid formatting, codification problems, etc.). On the contrary,
in those cases that the source data is in a correct format, the extraction process has correctly extracted
all the information to a relational database.
Table 3. Correct extraction percentage, average generated tables and query correctness per type.
Dataset Type Correct Extraction Average Generated Tables Query Retrieval Correctness
JSON 93% 2.4 100%
CSV 95% 1.0 100%
For correctly extracted data sources, we have then calculated the average number of tables
generated per dataset. As it was expected, in the case of CSV datasets, the extractor generates a single
table to represent the data in a queriable format. However, for JSON datasets, the query mapper
generates an average number of 2.4 tables per dataset. This means that, in the random sample of
datasets used for the evaluation, there is an average maximum nesting depth of JSON objects of three
levels. Therefore, consumers will require creating complex queries with no more than two table JOINs.
Finally, we have compared the data retrieved from our relational mapping with the original data
contained in the mapped data sources. We have executed different types of queries: simple queries that
retrieve the data from the main object table, queries that retrieve and connect data from multiple tables
and, in addition, queries that filter the data applying WHERE clauses. In all cases, we have compared the
results obtained through our query mapper with those results expected if the same data were selected
from the original data source.
The results show that, if the queries are correctly constructed (i.e., the users correctly connect the
disaggregated tables using the internal identifiers) the returned data matches the original one and the
JSON objects and CSV entries are correctly selected based on the applied filtering.
user provided in the process. Thanks to the separation between original and user-contributed data,
consumers can easily identify which data was original provided by the connected dataset and which
one introduced by other external contributors.
In addition, the query mapper component manages the access to the data by providing a
permission system that allows to control which users can manipulate the data (query and/or modify
it). This access management system enables data administrators to control those users with permission
to select specific part of the dataset or to introduce new data into the user space storage. Besides, the
query mapper tracks the owner of each inserted piece of data, meaning that, if required, only the user
that initially specified that specific portion of the contributed data could update or remove it from
the dataset.
On the other hand, we have provided an exemplary real application currently deployed in the
Spanish city of Zaragoza, highlighting the properties of IES Cities to enable an easier consumption
and generation of open data, and explaining how it can foster the creation of new urban application.
In addition, the paper includes an evaluation showing that the process proposed to extract and convert
JSON and CSV files to their relational representation behaves correctly for datasets up to ~57 MB,
particularly in the case of JSON data sources. The identified scalability issues are currently related with
the current implementation decisions and the characteristics of the testing environment (i.e., number
of thread processors), and future releases of the platform will solve them.
As future work, we plan to add support for connecting other types of datasets, e.g., XML data
sources to the IES Cities platform. However, although it could be possible to easily connect and map
simple XML datasets through the SQL query mechanism, following a similar approach to one applied
to JSON data sources, we need to perform further work to decide how more complex XML structures
(e.g., KML [33] or CityGML [34]) can be connected and accessed through the platform. These future
extensions will allow users to query and update, through the homogeneous access mechanism, a greater
variety of data formats, which will enable the platform to cover the most common types for open data
in urban ecosystems. Finally, in future versions of the platform, we also plan to support the execution
of queries joining data from different connected data sources, aggregating data originally available in
multiple formats.
Acknowledgments: This project has received funding from the European Unions Competitiveness and
Innovation Framework Programme under grant agreement No. 325097.
Author Contributions: Unai Aguilera has implemented the Query Mapper module, related parts of the IES
Cities REST API, and performed the evaluation included in the paper. Diego Lpez-de-Ipia and Jorge Prez
have participated in the application of the application evaluation methodology, provided background for the
work, supervising, and reviewing the whole IES Cities project. All the authors have contributed to the writing of
the paper.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
References
1. Davies, T. Open Data Institute: Open Data Barometer2013 Global Report. Available online: http://
www.opendataresearch.org/dl/odb2013/Open-Data-Barometer-2013-Global-Report.pdf (accessed on 28
June 2016).
2. FixMyStreet. Available online: http://www.fixmystreet.com/ (accessed on 2 June 2016).
3. FixMyTransport. Available online: http://www.fixmytransport.com/ (accessed on 2 June 2016).
4. Howe, J. The rise of crowdsourcing. Wired Mag. 2006, 14, 14.
5. Wikipedia, the Free Encyclopaedia. Available online: http://www.wikipedia.org (accessed on 2 June 2016).
6. OpenStreetMap. Available online: http://www.openstreetmap.org/ (accessed on 2 June 2016).
7. Atzori, L.; Iera, A.; Morabito, G. The internet of things: A survey. Comput. Netw. 2010, 54, 27872805.
[CrossRef]
8. Heath, T.; Bizer, C. Linked Data: Evolving the Web into a Global Data Space, Synthesis Lectures on the Semantic
Web, 1st ed.; Morgan & Claypool Publishers: San Rafael, CA, USA, 2011.
9. Kittur, A.; Pendleton, B.A.; Suh, B.; Mytkowicz, T. Power of the Few vs. Wisdom of the Crowd: Wikipedia
and the Rise of the Bourgeoisie. In Proceedings of the 25th Annual ACM Conference on Human Factors in
Computing Systems (CHI 2007), San Jose, CA, USA, 28 April3 May 2007.
10. Apache Software Foundation. Apache CouchDB. Available online: http://couchdb.apache.org (accessed on
21 June 2016).
11. Gen, Inc. MongoDB. Available online: http://mongodb.org (accessed on 21 June 2016).
12. Liu, Z.H.; Hammerschmidt, B.; McMahon, D. JSON data management: supporting schema-less development
in RDBMS. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data,
Zhangjiajie, China, 1820 November 2014; pp. 12471258.
13. Chasseur, C.; Li, Y.; Patel, J.M. Enabling JSON Document Stores in Relational Systems. WebDB 2013, 13,
1415.
14. Hausenblas, M.; Nadeau, J. Apache drill: interactive ad-hoc analysis at scale. Big Data 2013, 1, 100104.
[CrossRef]
15. Shvachko, K.; Kuang, H.; Radia, S.; Chansler, R. The hadoop distributed file system. In Proceedings of the
2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Village, NV, USA, 37 May
2010; pp. 110.
16. QuepyTransform Natural Language to Database Queries. Available online: http://quepy.machinalis.com/
(accessed on 2 June 2016).
17. NL-SPARQL: A Dialog-System Challenge Set for Converting Natural Language to Structured Queries.
Available online: http://research.microsoft.com/en-US/projects/nl-sparql/default.aspx (accessed on
2 June 2016).
18. CitySDKOpen, Harmonized SDK APIs. Available online: http://www.citysdk.eu (accessed on
2 June 2016).
19. Open311A Collaborative Model and Open Standard for Civic Issue Tracking. Available online: http:
//open311.org/ (accessed on 2 June 2016).
20. Open511Open Format for Publishing Event Road Data. Available online: http://open511.org/ (accessed
on 2 June 2016).
21. CKANThe Open Source Data Portal Software. Available online: http://ckan.org/ (accessed on
2 June 2016).
22. SocrataThe Data Platform for 21st Century Digital Government. Available online: http://www.socrata.
com (accessed on 2 June 2016).
23. Apache Stanbol. Available online: https://stanbol.apache.org (accessed on 2 June 2016).
Sensors 2016, 16, 1022 25 of 25
24. PrudHommeaux, E.; Seaborne, A. SPARQL Query Language for RDF. Available online: https://www.w3.
org/TR/rdf-sparql-query/ (accessed on 2 June 2016).
25. The Data TankAn Open Data Management Systems Maintained by OKFN Belgium. Available online:
http://thedatatank.com/ (accessed on 4 September 2013).
26. MK:Smart. Available online: http://www.mksmart.org (accessed on 2 June 2016).
27. HyperCat Consortium. Available online: http://www.hypercat.io (accessed on 2 June 2016).
28. FI-WARE. Available online: http://www.fiware.org (accessed on 2 June 2016).
29. Ushahidi Open Source Software for Information Collection, Visualization and Interactive Mapping. Available
online: http://ushahidi.com/ (accessed on 28 June 2016).
30. CrowdmapCollaborative Map Editor. Available online: http://crowdmap.com (accessed on 20 April 2016).
31. W3C, PROV Model Primer. Working Group Note 30 April 2013. Available online: http://www.w3.org/TR/
2013/NOTE-prov-primer-20130430/ (accessed on 4 September 2013).
32. Amberg, M.; Hirschmeier, M.; Wehrmann, J. The Compass Acceptance Model for the analysis and evaluation
of mobile services. Int. J. Mob. Commun. 2004, 23, 248259. [CrossRef]
33. KML. Open Geospatial Consortium (OGC). Available online: http://www.opengeospatial.org/standards/
kml (accessed on 2 June 2016).
34. CityGML. Open Geospatial Consortium (OGC). Available online: http://www.opengeospatial.org/
standards/citygml (accessed on 2 June 2016).
2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC-BY) license (http://creativecommons.org/licenses/by/4.0/).