This is the authors’ preprint version. The article is accepted for publication in the Special Issue on Data Transparency
in the ACM Journal of Data and Information Quality (JDIQ). ©2021
arXiv:2105.09312v2 [cs.DB] 21 May 2021
Knowledge-driven Data Ecosystems Towards Data
Transparency
SANDRA GEISLER, Fraunhofer FIT and RWTH Aachen University, Germany
MARIA-ESTHER VIDAL, TIB-Leibniz Information Centre for Science and Technology, Gerrmany
CINZIA CAPPIELLO, Politecnico di Milano, Italy
BERNADETTE FARIAS LÓSCIO, Federal University of Pernambuco, Brazil
AVIGDOR GAL, Technion Israel Institute of Technology, Israel
MATTHIAS JARKE, RWTH Aachen University and Fraunhofer FIT, Germany
MAURIZIO LENZERINI, Sapienza Università di Roma, Italy
PAOLO MISSIER, Newcastle University, United Kingdom
BORIS OTTO, TU Dortmund University, Germany and Fraunhofer ISST, Germany
ELDA PAJA, IT University of Copenhagen, Denmark
BARBARA PERNICI, Politecnico di Milano, Italy
JAKOB REHOF, TU Dortmund University, Germany and Fraunhofer ISST, Germany
A Data Ecosystem offers a keystone-player or alliance-driven infrastructure that enables the interaction of
different stakeholders and the resolution of interoperability issues among shared data. However, despite years
of research in data governance and management, trustability is still affected by the absence of transparent and
traceable data-driven pipelines. In this work, we focus on requirements and challenges that data ecosystems
face when ensuring data transparency. Requirements are derived from the data and organizational management,
as well as from broader legal and ethical considerations. We propose a novel knowledge-driven data ecosystem
architecture, providing the pillars for satisfying the analyzed requirements. We illustrate the potential of our
proposal in a real-world scenario. Lastly, we discuss and rate the potential of the proposed architecture in the
fulfillment of these requirements.
Additional Key Words and Phrases: Data transparency, data ecosystems, data quality, trustability
1 INTRODUCTION
Industrial digitalization and the use of information technologies in public and private sectors
provide evidence of the pivotal role of data. However, despite the paramount relevance of datadriven technologies, organizations demand alliance-driven infrastructures capable of supporting
controlled data exchange across diverse stakeholders and transparent data management.
Data Ecosystems (DEs) are distributed, open, and adaptive information systems with the characteristics of being self-organizing, scalable, and sustainable [27]. While centered on data, the main
Authors’ addresses: Sandra Geisler, sandra.geisler@fit.fraunhofer.de, Fraunhofer FIT and RWTH Aachen University, Germany, Schloss Birlinghoven, Sankt Augustin, 53757; Maria-Esther Vidal, maria.vidal@tib.eu, TIB-Leibniz Information Centre
for Science and Technology, Gerrmany, Welfengarten 1B, Hannover, 30167; Cinzia Cappiello, cinzia.cappiello@polimi.it,
Politecnico di Milano, Italy, piazza Leonardo da Vinci 32, Milano, 20133; Bernadette Farias Lóscio, bfl@cin.ufpe.br, Federal
University of Pernambuco, Brazil, Cidade Universitaria, Recife/PE, 50740-560; Avigdor Gal, avigal@ie.technion.ac.il, Technion
Israel Institute of Technology, Israel, Technion City, Haifa, 32000; Matthias Jarke, jarke@dbis.rwth-aachen.de, RWTH Aachen
University and Fraunhofer FIT, Germany, Ahornstrasse 55, Aachen, 52056; Maurizio Lenzerini, lenzerini@diag.uniroma1.it,
Sapienza Università di Roma, Italy, via Ariosto 25, Roma, I-00185; Paolo Missier, paolo.missier@ncl.ac.uk, Newcastle
University, United Kingdom, Firebrick Avenue, Newcastle upon Tyne, NE4 5TG; Boris Otto, boris.otto@cs.tu-dortmund.de,
TU Dortmund University, Germany, Otto-Hahn-Str. 12, Dortmund, 44227, Fraunhofer ISST, Germany, Emil-Figge-Straße 91,
Dortmund, 44227; Elda Paja, elpa@itu.dk, IT University of Copenhagen, Denmark, Rued Langgaards Vej 7, Copenhagen
S, DK-2300; Barbara Pernici, barbara.pernici@polimi.it, Politecnico di Milano, Italy, piazza Leonardo da Vinci 32, Milano,
20133; Jakob Rehof, jakob.rehof@cs.tu-dortmund.de, TU Dortmund University, Germany, Otto-Hahn-Str. 12, Dortmund,
44227, Fraunhofer ISST, Germany, Emil-Figge-Straße 91, Dortmund, 44227.
111
111:2
S. Geisler, M.-E. Vidal et al.
concern with DEs is about knowledge generation and sharing. Thus, they aim to solve issues like
learning from unstructured and heterogeneous data, and construct new abstractions and mappings.
They may also offer various data-centric services, including query processing and data analytics.
DEs are equipped with computational methods to exchange and integrate data while preserving
personal data privacy, data security, and organizational data sovereignty. The report of the Dagstuhl
Seminar 19391 (September 22-27, 2019)1 on Data Ecosystems: Sovereign Data Exchange among
Organizations [6] contains summaries of discussions and abstracts of talks from the seminar on
various topics, including requirements, use cases, and architectures. Diverse reference architectures
rely on DE foundations [2, 26]. Keystone player-driven data ecosystems and B2C platforms like
Google, Alibaba, or Facebook are hugely successful. In contrast, the adoption of alliance-driven
platforms which aim at more equitable control and data sharing [28] is still lagging, even in crucial
domains such as data-driven B2B engineering collaboration [19] or biomedicine. This paper focuses
on the alliance-driven setting, even though many addressed issues occur also in the other category.
A few works address general data quality (DQ) aspects of DEs (e.g., [4, 12, 23, 35]). In [12], the
authors focus on open DEs and claim that data availability and DQ need to be guaranteed, so
as to prevent users to be hesitant to use data. Kitsios et al. [23] depict DQ assessment as one of
the fundamental components for building and maintaining a DE. DQ assessment requires the
definition of a DQ model composed of DQ dimensions and metrics. Several DQ dimensions have
been defined in the literature, as discussed in [4]. Many dimensions have a possible impact on
data fairness and trustability, in particular completeness, accuracy, and consistency, which have a
significant impact both, on transparently processing data for analysis and on data pipelines. The
lack of accountability for data transparency is one of the severe limitations of existing interoperable
methods and represents a critical aspect of data quality. This paper starts from the hypothesis that
these limitations could be a significant reason for the slow adoption of DEs.
To account for data transparency, the rest of this paper offers the following contributions: (1) an
analysis of the specific requirements arising for DEs and from DEs regarding data governance and
transparency aspects; (2) a new form of networks of knowledge-driven DEs towards trustworthiness
and wider adoption; (3) challenges that stem from the identified requirements; and (4) an assessment
of how the various DE types address these requirements.
2 REQUIREMENTS OF TRANSPARENT DATA ECOSYSTEMS MOTIVATED BY AN
EXAMPLE
We motivate the need for expressive data ecosystems with an example from the health domain.
Subsequently, we grasp the requirements demanded for data transparency in similar scenarios.
2.1
Motivating Example
Consider a use case of multi-site clinical studies as an example to illustrate the impact that managing
multiple stakeholders have on interoperability and transparency requirements. In these studies,
several parties are involved; they include clinics, resident doctors, data scientists, patients, study
nurses, quality assurance, researchers, and care services. A stakeholder may have multiple sources
generating data. For example, clinicians conduct examinations and collect, amongst others, sensor
readings, medical images, test results, and diagnostic reports. These data collections are processed
(e.g., transformed, curated, and integrated); for transparency reasons, they are potentially annotated
with meta-data, domain vocabularies, and data quality values.
Data is analyzed to uncover insights that can support clinicians to conduct thoughtful diagnostics and effective treatments. Data management tasks are also influenced by the organization’s
1 http://www.dagstuhl.de/19391
Knowledge-driven Data Ecosystems Towards Data Transparency
111:3
regulations or higher instances, such as regulations for data protection or rules defined by the
hospital, and strategic decisions. Patients may require transparency about both, their treatment
and the privacy protection of their data in cross-clinical studies. Each of these data management
tasks brings up already multiple challenges for data transparency and data quality management.
Additionally, data is exchanged between stakeholders to fulfill the goals of the studies; data collected
from the various sites has to be pseudonymized and integrated to be audited by quality assurance.
But transparency of these processes must be maintained to protect against scientific fraud. Further
integration with data from additional parties, such as health insurance companies, may be needed
to be finally analyzed by study researchers.
The study setting corresponds to a network of knowledge-driven DEs. This network aligns the
stakeholder DEs and their data; it also uses meta-data to describe the network and its constituents.
Furthermore, the network is influenced by regulations, contracts, or agreements specific to the study
at hand. They may include participation agreements created by insurance companies and patients
consent forms authorizing data usage for specific studies. Heterogeneity issues across the different
network DEs impose challenges for DQ management. Moreover, documenting computational
methods performed to assess and curate data quality issues is crucial to guarantee data transparency.
2.2
Requirements Analysis
The motivating example highlights the multiple issues that a DE needs to cover in order to enhance
trustability of the involved stakeholders. These issues are not only present in biomedical applications, but rather exist in any application where crucial decisions are driven by data [3]. Based
on literature and reports from current European data sharing and data space projects [3, 10, 29],
requirements can be categorized along data management, organizational aspects, and legal and
ethical issues. In terms of data management, tackling the challenges outlined in the motivating
example, demands (meta)data sharing among different stakeholders in a secured and traceable
manner. At an organization level, trustable data exchange requires complex ecosystems that underlie
organizational-specific business models, processes, and strategies to enforce sovereignty, privacy,
and protection of both, data and analytical outcomes. Furthermore, sharing sensitive and personal
data, e.g., clinical records, should comply with data protection regulations and legal compliance
at national and international levels. More importantly, accounting for ethical decisions made by
stakeholders and algorithms is crucial and the ability to provide reliable and verifiable explanations
of these decisions. Meeting these requirements at a legal and ethical level empowers DEs to safeguard data privacy and mitigate unfairness in data-driven pipelines. Moreover, the satisfaction of
these requirements provides the foundations for ensuring that clinical data is only used according
to consents given by these data owners, i.e., the patients. Next, each of these three requirement
categories is described in more detail.
Data management requirements. DEs, as described in the motivating example, demand sharing
of data with a high variety (e.g., in terms of type, structure, size, or frequency). The requirements
listed in this category concern both, data and metadata. Data quality management has to be able to
(DMR1) handle all kinds of data and offer common DQ tools to describe, query, and assess quality
values for the data. In the medical domain for example, this comprises unstructured data, such
as images and texts, but also highly structured data from databases, csv files, and data streams.
Additionally, (DMR2) the data has to be fit for sharing. Data has to exhibit quality values, which
fulfill a certain quality standard, suitable for sharing it in a defined context. Data consumers,
especially in data markets, have thereby the possibility to query data based on its quality. Hence,
data can be rejected, if it does not satisfy the negotiated standards. In the motivating example,
this could mean that the reading center rejects the data, because important values are missing,
111:4
S. Geisler, M.-E. Vidal et al.
i.e., the completeness of the data set is too low. Furthermore, data transparency plays a crucial
role for enhancing trust for all stakeholders. (DMR3) Data transparency has to be enabled from the
origin of the data until its usage. At any time in a data-driven pipeline, the current meaning of the
data has to be available, as well as metadata describing data transformations made by the different
components of the pipeline. This explicitly includes traceability and transparency of algorithms
and their results (e.g., for data curation and integration, or for prediction). Consider for example
the use of data from cancer registries by researchers and other registries. Both need to know
explicitly how the data has been acquired and modified to estimate its value for the research at hand.
Potential conflicts between data transparency and company secrets or privacy may exist. Thus,
transparency must be offered to all the stakeholders according to their role in the DE, and in terms
of consent management and usage control. Hence, (DMR4) data quality management needs to take
trade-offs into account and provide dimensions and assessment metrics that enable stakeholders to
rate the possible impact of, e.g., data curation. Anonymized medical data for example may loose
its value for further research if important attributes are eliminated from a data set. Lastly, data
integration and querying over multiple data sources and across organizations are required in a
multitude of scenarios. For this, mappings among data sources are defined either manually or
(semi-)automatically by schema matching. For data quality management this implies several aspects,
but basically (DMR5) stakeholders should be part of the loop of data quality assessment. They should
be able to rate the quality of every step in a data-driven pipeline, e.g., schema and entity mappings
or query answers. The automatic matching between huge medical taxonomies, e.g., for decision
support systems, may be very error-prone as the taxonomies per se have quality problems. (DMR6)
The impact of adding a new component to a DE should be measurable. The DEs and their stakeholders
should be able to rate the impact of, e.g., the information gain of adding a new data set. This is
a crucial aspect especially when considering to pay for a costly data set or when the integration
of the data set requires a lot of upfront effort in terms of data cleaning, transformation, or data
integration.
Organizational-centric requirements. In cases of sensitive data exchange and processing, data
must be transparently used according to organizations’ policies, as well as its business models
and strategies. (OCR1) Enabling data governance is crucial for the appropriate data exchange
and sharing according to the organizations’ strategies and business models. (OCR2) Ensuring
traceability of data sovereignty is essential to increase trust among stakeholders. Again this has to
be ensured throughout the whole data processing pipeline including data quality assessment and
curation. For example, the willingness of patients to use applications or participate in studies may
be increased by giving them the opportunity to enforce access and usage policies. Furthermore,
(OCR3) business, certification, and utility models need to be established to certify, based on data
quality values and other characteristics, the monetary value of exchanged and transformed data. The
monetary value of medical data is manifold, e.g., data from clinical studies may be of interest to other
parties, such as pharmaceutical or insurance companies, to create or improve products. (OCR4)
Adherence to data and data processing standards is required to enhance DQ and interoperability
across stakeholders, which is specifically important in the medical domain. Standards such as FHIR2
have made an important step forward reaching these goals in the clinical domain. (OCR5) Flexible
DQ management for different coordination and negotiation models among stakeholders (e.g., clinics,
data scientists, and insurance companies), and considering the evolution of these models over time.
Legal and ethical requirements. As shown in the motivating example, respecting personal data
privacy and security during data management, exchange, and analytics impairs requirements
2 http://www.hl7.org/fhir/index.html
Knowledge-driven Data Ecosystems Towards Data Transparency
111:5
at both, legal and ethical levels. Both categories of requirements are aligned with the European
Union guidelines for Trustworthy AI [14]. (L&ER1) Providing expressive legal frameworks for exchanged data, including legal references, responsibilities, licenses, and ethical guidelines is essential.
(L&ER2) Accounting and mitigating bias and fairness ensure that the outcomes of the execution
of the system components are independent of sensitive attributes (e.g., gender, age, ethnicity, or
health conditions) and augment confidence in the impartiality of the system behavior. (L&ER3)
Endeavouring safeness and robustness of the decisions made by each component that exchanges,
processes, or analyzes data. Thus, the deployment of data-driven pipelines and their outcomes
will guarantee the compliance with ethical guidelines (e.g., the respect of the patient consents),
and the misuse reduction that could conduce to data quality issues, data privacy violation, and
unfairness. (L&ER4) Enforcing data protection and ownership safeguards privacy, sovereignty, and
legal compliance with licenses and regulations for data sharing, exchange, and processing. Thus,
the satisfaction of this requirement will ensure that clinical data is used by the distinct parties
as indicated in the patient’s consents. (L&ER5) Pursuing diversity and non-discrimination in data
collections shared, exchanged, and processed by data-driven pipelines. As a result, the risk of
excluding specific entities (e.g., patients with a given health condition) is mitigated, and the chances
of covering all the representative entities of the population increase. (L&ER6) Trackability of
regulations compliance in the way that each data-driven decision can be documented and validated
in terms of legal regulations, business models, and ethical guidelines. Lastly, (L&ER7) Trustworthiness and Reliability of data-driven pipelines demand the accurate measurement, validation, and
interpretation of each of the decisions taken by the system components in compliance with legal
and ethical guidelines of the stakeholders. Thus, data owners (e.g., patients, insurance companies,
and researchers) will be able to trace management and analysis methods performed over their data.
As illustrated in the next section, this paper positions networks of knowledge-driven DEs as alliancedriven decentralized infrastructures empowered with components to satisfy the requirements listed
above. Lastly, section 4.1 presents the challenges to be achieved to meet these requirements.
3 KNOWLEDGE-DRIVEN DATA ECOSYSTEMS: OVERVIEW & ARCHITECTURE
The literature defines DEs in different ways. For instance, Oliveira and Farias Lóscio [27] define
Data Ecosystems as a łset of networks composed of autonomous actors, which consume, produce,
or provide data or other related resources.ž Other definitions add that the results created by
the consumption and processing of the data should return to the ecosystem [30]. In [3, 29], the
emergence of the concept of DEs is traced and taxonomically situated among related concepts such
as Business Ecosystems, Digital Ecosystems, and Platform Ecosystems [16].
Cappiello et al. [6] synthesized the following comprehensive definition of a data ecosystem DE
as a 4-tuple DE=⟨Data Sets, Data Operators, Meta-Data, Mappings⟩ where:
• Data sets can be structured or unstructured, can have different formats, e.g., CSV, JSON or
tabular relations, and can be managed using different management systems.
• Data Operators are functions used for accessing or managing data in the data sets.
• Meta-Data provides means for describing the DE context domain, can be used to specify the
meaning of data and associated data operations. It comprises i) Domain ontology, providing
a coherent and unified view of concepts, relationships, and constraints of the domain of
knowledge, associating formal semantics with the elements of the domain. If appropriate,
several ontologies for different portions of the domain can be devised. ii) Properties that
enable the definition of qualitative aspects for the elements of the ecosystem, such as quality
and provenance requirements for data sets and operations. iii) Descriptions to associate
111:6
S. Geisler, M.-E. Vidal et al.
Fig. 1. A Network of Data Ecosystems Empowered with Strategy and Business Models and Regulations.
annotations to the elements of the system for explaining relevant characteristics of data sets
and operations. No specific formal language or vocabulary is required in descriptions.
• Mappings express correspondences among the different components of the data ecosystem.
The mappings are as follows: i) Mappings among ontologies to represent associations among
concepts in different ontologies constituting the domain ontology of the ecosystem. ii) Mappings between data sets and ontology to represent relations among the data in the DE data
sets and the domain ontology, to allow for their interpretation in terms of the ontology.
Data ecosystems can be further empowered with services that exploit the knowledge encoded
in the meta-data and operators to satisfy business requirements, such as data transparency and
traceability. We nam these knowledge-driven data ecosystems. Services include query processing,
data transformation, anonymization, data quality assessment, or mapping generation. The following
correspond to examples of notable services:
• Concept or mapping discovery: identify a new concept or a new mapping using inductive reasoning and techniques from schema matching, taking into account aspects of uncertainty [15].
Based on the result, the domain ontology and the mappings can be augmented.
• Data set curation: identify the best way to keep humans in the loop in order to create a curated
version of a data set in a DE (see [1] for limitations of humans in matching). Services can
also update the properties of a DE to indicate the provenance of the new curated data sets
and manage new generated data from data transformation, analysis, and learning.
• Procedure synthesis: construct new procedures out of data operators and other building blocks
by composing existing services towards new goals. In complex and evolving systems, it is
infeasible to program procedures and even queries without automatic support. Also, exploring
repositories and libraries of existing procedures should be possible.
In our running example, stakeholders like clinics, insurance companies, and researchers can create
their own knowledge-driven DE. Each DE comprises data sets and programs for accessing, managing,
and analyzing their data. Interoperability issues across data sets of a DE are solved in a unified view
represented in the DE ontology. Mappings between the data sets and the DE ontology describe the
meaning of the data sets. Moreover, the description of the data operators enhances data transparency
and provides the basis for tracking down the computational methods executed against a DE.
Knowledge-driven Data Ecosystems Towards Data Transparency
111:7
To enable collaboration across various knowledge-driven DEs and cope with complex scenarios,
a set of DEs can be connected in a network. For this, we envision an ecosystem-wide meta-data
layer where the entire ecosystem is described. Figure 1 depicts a network where nodes and edges
correspond to knowledge-driven DEs and mappings among them, respectively. In this configuration,
the meta-data layer describes each of the nodes (i.e., DE) in terms of descriptions, properties, and
domain ontologies. The following mappings can be defined among a network DEs:
• Mappings between domain ontologies: state correspondences among the domain ontologies of
two nodes or between one node and the global meta-data layer.
• Mappings between properties: describe relationships among properties in two nodes. For
example, the provenance of two curated versions of a data set could be the same.
• Data set Mappings: represent correspondences among data sets of different nodes.
Finally, knowledge-driven DEs can be enhanced with additional meta-data layers to enable the
description of business strategies and the access regulations. Figure 1 depicts the main components
of a network of knowledge-driven DEs empowered with these layers. As can be observed, this
enriched version of a network of DEs comprises: (1) Ontological formalisms or causality models that
enable the description of the relationships between the data sets received as input and produced
as output for the services or operators of a DE. (2) Meta-data describing business strategies that
enable the definition of the stakeholders of the network and their roles. (3) Objectives to be met
and the dependencies among the tasks that need to be performed to achieve these objectives.
(4) Agreements for data exchange and criteria for trustworthiness. (5) Regulations and licenses for
data access and for data privacy preservation. (6) Services composing services of the nodes of the
network. (7) Services able to monitor and explore decisions taken by services and operators.
A network of knowledge-driven DEs will facilitate controlled data exchange across stakeholders in
the running example. This network can be hosted and maintained by the consortium of stakeholders.
The meta-data layer specifies alignments among the data sets in each DE. It enables the definition
of business models and access regulations to be satisfied when the data from one DE (e.g., the
clinics DE) is transferred to other DEs (e.g., the DE of the insurance companies or the researchers).
Moreover, the formal descriptions of the data sets and operators enhance transparency not only at
an individual DE level (e.g., clinics or insurance companies), but also in the network. Lastly, services
to monitor how exchanged data is processed in the various multi-site clinical studies empower the
network to verify if legal and ethical regulations are fulfilled.
4 ENABLING DATA TRANSPARENCY IN DATA ECOSYSTEMS
In this section, we discuss what are the challenges implementing the three groups of requirements
introduced in Section 2.2 in the network of knowledge-driven DEs. In Section 4.2 we evaluate and
rate how hard it is to meet the requirements for the different types of DEs we defined in this section.
4.1
Challenges of Enabling Data Transparency in Data Ecosystems
Data Management Requirements. Sharing heterogeneous data requires guarantees in terms of
data quality and transparency. As regards data quality, all the collaborative entities should assess
data quality using a common set of DQ services.
In fact, in order to get the maximum benefits from the participation to a DE, actors should be
able to identify, evaluate, and get the most suitable data for the intended usage. Starting from
facilitating the access to available DE data, some existing solutions propose that query processing
over heterogeneous data sets rely on a unified interface for overcoming interoperability issues,
usually based on meta-models [20]. A few DEs have been proposed, mainly focusing on data
ingestion and meta-data extraction and management. Exemplary approaches include Constance [17]
111:8
S. Geisler, M.-E. Vidal et al.
and Ontario [13]. Data integration enables the transformation of heterogeneous data sources or
views under a unified access schema [25]. Data integration systems comprise data collection and
curation steps and resort to record linkage, schema matching, mapping, and data fusion to integrate
data from a collection of data sets [9]. At the heart of the data integration realm lies the matching
task [5], in charge of aligning attributes of data sources both at a schema and data level, in order
to enable formal mappings. Numerous algorithmic attempts (matchers) were suggested over the
years for efficient and effective integration (e.g., [7, 11, 22, 24]). Both practitioners and researchers
also discussed data spaces as an appropriate data integration concept for DEs. Data spaces do not
require a common schema and achieve data integration on a semantic level.
Moreover, only a small percentage of data integration systems provide causal explanations to
support traceability [34], as well as query processing methods to navigate these explanations
[32] efficiently. Existing rule-based approaches that allow for a declarative specification of data
transformation, integrity, and integration represent building-blocks for tracking the validity of the
domain constraints in all the data-driven pipeline steps in a knowledge-driven DE. These approaches
include rule-based entity linking (e.g., [33]), mapping-based tools to perform the process of data
integration (e.g., [21]), and declarative languages like SHACL [8], to describe integrity constraints.
Knowledge-driven DEs represent a new paradigm for data integration able to trace and annotate
provenance and causal relations existing during data ingestion, curation, and integration [18].
DQ challenges are also related to the data variety, and to the fact that the description and
measurement of data quality is highly subjective [31], especially if data are used for completely
different purposes from the ones they were originally collected for (re-purposing). Algorithms able
to describe and assess the quality of very heterogeneous sources and very different stakeholder
views are required, together with an agreement about DQ assessment standards in the DE. These
standards must include general DQ dimensions but also be derived from the domain at hand to be
accepted by the parties sharing the data. The DQ assessment phase requires to provide meta-data
and rules to support the selection and reuse of data. DQ should be also assessed on derived data.
Challenges here are related to the evaluation of the quality of the outputs of any transformation
(e.g., aggregation, formulas, integration) and to the fact that in some cases quality evaluation could
be performed in a semi-automatic way. Different stakeholders may have different views on them
(e.g., different levels of granularity) and, as a consequence, their integration needs may vary. The
input offered by stakeholders regarding the quality of the integration is therefore needed to tailor
it to their needs. Data transparency is enabled by a combination of accurate meta-data, including
provenance. The facilities to add these to a single, isolated DE are described in Section 3.
Networks of DEs introduce further complications: i) meta-data descriptions provided for similar
entities by each DE may differ, and may require semantic re-conciliation; ii) levels of transparency
provided through provenance may differ across the nodes. A general challenge is therefore to
achieve full transparency, or at least formally characterize the boundaries of what is visible, in
the presence of łblack spotsž in the global information flow across all nodes in the network.
Specific transparency challenges are also related to the difficulties in tracking all the operations
performed on data. These new tasks call also for novel reasoning services, based on a sort of reverse
engineering process, which rewrites data source queries in terms of global schema (or, ontology)
expressions. Future technical support for transparency, traceability, and usage policy enforcement
could be developed from distributed ledger technology (e.g., blockchains) and secure multi-party
computation (MPC) services.
Organization-centric Requirements. The enabling of data governance by several collaborating
entities requires the management of different governance models, different levels of data governance
maturity, and a clear attribution of responsibilities. The traceability of data sovereignty requires
process or dependency models which define how data sovereignty has been executed, e.g., by usage
Knowledge-driven Data Ecosystems Towards Data Transparency
111:9
control. The challenge is here to define the content and granularity of information a data owner
needs to know based on her role according to regulations valid for the domain. A further challenge
is to consider data quality assessment and curation, that can add bias to the data and may also lead
to unwanted usage of the data. We need to find business models and regulations which respect the
interests of involved stakeholders, adequately react to unwanted data usage, and increase the trust
between them. A big challenge is how a data ecosystem can ensure that stakeholders comply to
data and data processing standards. Usually, all stakeholders implement their own processes in
multiple ways. Enforcing certain standards in a data ecosystem or network of data ecosystems to
ensure data quality and data transparency across heterogeneous processes and their interactions
is challenging. It requires to negotiate common standards and developing a model feasible for
SMEs and big stakeholders to implement. Stakeholders could get a certification which testifies that
they adhere to the standards. This makes the quality grade of data sources more transparent to
data consumers. As described by Curry and Sheth [10] data ecosystems may vary according to
the degree of interaction between stakeholders, coordination of data exchange, and control over
data sources. This variety heavily influences also data quality management as it may substantially
change the way to assess, monitor, and improve data quality depending on the model.
Legal & Ethical Requirements. The satisfaction of laws, fundamental rights, or ethical guidelines
demands traceability and certification of data-driven pipelines in a DE. The big challenge is to
devise services capable of certifying robustness in data ecosystems according to the national and
international legal norms for data protection and fundamental rights, while safeguarding data
sovereignty. Furthermore, formalisms, models, and computational algorithms able to interoperate
across various stakeholders represents grand challenges to ensure data transparency.
4.2
How can Data Ecosystems Fulfil Data Transparency Requirements?
Concluding our analysis, we analyze the three DE architectures presented in Section 3 with respect
to their potential in satisfying the requirements and overcoming the challenges towards data
transparency. Each requirement is graded following a three-stars scheme: i) ★ means that the
requirement is unsatisfied; ii) ★ ★ indicates that the data ecosystem has the potential to satisfy
the requirement, but it is challenging; iii) ★ ★ ★ states that the data ecosystem has the potential
to fully satisfy the requirement. Each requirement is also evaluated in terms of three levels of
satisfaction: I) Complete: The requirement is fully achieved. II) Traceable: The results of the
requirement implementation can be traced down. III) Verifiable: The inspection, demonstration,
test, and analysis of the requirement implementation can be verified.
Table 1 summarizes the analysis of the types of DEs in Section 3. We can observe that the baseline
architecture of DEs has the potential to fulfill several requirements (i.e., DMR1-DMR6, OCR4,
L&ECR2, and L&ECR5). However, since these DEs are only equipped with data sets, operators, and
meta-data, it is challenging for them to keep the stakeholders in the loop during the data quality
rating or for assessing the impact of adding new components. For the same reason, these DEs
cannot trace or validate the satisfaction of none of the requirements. In contrast, the other two types
of DEs are able to fully satisfy the data management requirements (i.e., DMR1-DMR6). Nevertheless,
requirement traceability and verifiability still remain a challenge because of the multiple problems
of interoperability, data access, and legal regulations imposed by the stakeholders of a knowledgedriven DE or each individual node in a network of DEs. Moreover, a single knowledge-driven DE
cannot interact with other DE or circulate their business, regulations, or strategies. As a result,
most organization-centric and legal and ethical requirements cannot be satisfied, or if so, it is very
challenging. Lastly, networks of knowledge-driven DEs are equipped with meta-data, services, and
strategic and business models that facilitate the description of each node and the documentation of
111:10
S. Geisler, M.-E. Vidal et al.
Table 1. Three-stars Model for Requirement Satisfaction.
DMR1
DMR2
DMR3
DMR4
DMR5
DMR6
Data Ecosystem
Complete Traceable Verifiable
★★★
★
★
★★★
★
★
★★★
★
★
★★★
★
★
★★
★
★
★★
★
★
OCR1
OCR2
OCR3
OCR4
OCR5
Complete
★
★
★
★★★
★
Data Ecosystem
Traceable Verifiable
★
★
★
★
★
★
★
★
★
★
L&ER1
L&ER2
L&ER3
L&ER4
L&ER5
L&ER6
L&ER7
Complete
★
★★
★
★
★★
★
★
Data Ecosystem
Traceable Verifiable
★
★
★
★
★
★
★
★
★
★
★
★
★
★
Data Management Requirements
Knowledge-Driven Data Ecosystem Network of Knowledge-Driven Data Ecosystems
Complete Traceable Verifiable Complete Traceable
Verifiable
★★★
★★
★★
★★★
★★
★★
★★★
★★
★★
★★★
★★
★★
★★★
★★
★★
★★★
★★
★★
★★★
★★
★★
★★★
★★
★★
★★★
★★
★★
★★★
★★
★★
★★★
★★
★★
★★★
★★
★★
Organizational Challenges
Knowledge-Driven Data Ecosystem Network of Knowledge-Driven Data Ecosystems
Complete Traceable Verifiable Complete Traceable
Verifiable
★
★
★
★★★
★★
★★
★
★
★
★★★
★★
★★
★
★
★
★★★
★★
★★
★★★
★★
★★
★★★
★★
★★
★
★
★
★★★
★★
★★
Legal & Ethical Challenges
Knowledge-Driven Data Ecosystem Network of Knowledge-Driven Data Ecosystems
Complete Traceable Verifiable Complete Traceable
Verifiable
★
★
★
★★★
★★
★★
★★
★★
★★
★★
★★
★★
★
★
★
★★
★★
★★
★
★
★
★★
★★
★★
★★
★★
★★
★★
★★
★★
★
★
★
★★
★★
★★
★
★
★
★★
★★
★★
the negotiations required to exchange across the network. Thus, despite traceability and verifiability
are challenging, these DEs are the only ones furnished with components to enable data transparency.
We hope that this analysis contributes to the understanding of data transparency challenges. We
also aim to encourage the research community to develop trustable networks of knowledge-driven
DEs, enabling, thus, DQ management, data governance, and sovereignty, as well as mechanisms to
trace and verify the requirement fulfillment.
5 CONCLUSION
In this work, we have tackled the challenges that DEs face on their way to become łsmarter,"
equipped with a knowledge layer. In particular, we focused on data quality and data transparency
challenges. Using the motivating example of multi-site clinical studies, we have outlined six data
management requirements, five organizational-centric requirements, and seven legal and ethical
requirements. We then presented a specific architecture from which data quality challenges were
derived and discussed. Table 1 summarizes the discussion by presenting for each of three types of
DEs to what extent each of the requirements can be completed, traced, and verified.
With the increasing need for integrated data sets and infrastructures to support DEs, we expect
their impact on organizations to increase. As data quality in general and data transparency in
particular, become a significant issue in data management, we hope this work offers a guideline for
researchers and practitioners when investigating developments of knowledge-driven DEs.
ACKNOWLEDGEMENTS
The authors are grateful to the Dagstuhl team for hosting us in September 2019 (Dagstuhl Seminar
19391). Initial ideas that serve as a basis for this paper were originated and discussed there. Gal also
acknowledges the support of the Benjamin and Florence Free Chair. Lenzerini acknowledges the
support of MUR-PRIN project łHOPEž, grant n. 2017MMJJRE, and of EU under the H2020-EU.2.1.1
project TAILOR, grant id. 952215. Vidal acknowledges the support of the EU H2020 project iASiS,
Knowledge-driven Data Ecosystems Towards Data Transparency
111:11
grant id. 727658 and CLARIFY grant id. 875160. Geisler acknowledges the support of the German
Innovation Fund project SALUS, grant id. 01NVF18002. This work has also been supported by the
German Federal Ministry of Education and Research (BMBF) in the context of the InDaSpacePlus
project (grant id. 01IS17031), Fraunhofer Cluster of Excellence "Cognitive Internet Technologies"
(CCIT) and by the Deutsche Forschungsgemeinschaft (DFG) under Germany’s Excellence Strategy EXC-2023 Internet of Production - 390621612. Pernici acknowledges the support of the EU H2020
Crowd4SDG project, grant id 872944.
REFERENCES
[1] Rakefet Ackerman, Avigdor Gal, and Roee Sagi, Tomerand Shraga. A cognitive model of human bias in matching. In
Abhaya C. Nayak and Alok Sharma, editors, PRICAI 2019: Trends in Artificial Intelligence, pages 632ś646, Cham, 2019.
Springer International Publishing. ISBN 978-3-030-29908-8.
[2] Sebastian R. Bader, Maria Maleshkova, and Steffen Lohmann. Structuring reference architectures for the industrial
Internet of Things. Future Internet, 11(7):151, 2019.
[3] Martina Barbero, Arne Berre, Davide dalle Carbonare, ..., and Walter Weigel. Towards a European-governed data
sharing space. https://www.bdva.eu/sites/default/files/BDVADataSharingSpacesPositionPaperV2_2020_Final.pdf,
November 2020.
[4] Carlo Batini and Monica Scannapieco. Data and Information Quality - Dimensions, Principles and Techniques. DataCentric Systems and Applications. Springer, Cham, 2016. ISBN 978-3-319-24104-3. doi: 10.1007/978-3-319-24106-7.
URL https://doi.org/10.1007/978-3-319-24106-7.
[5] Zohra Bellahsene, Angela Bonifati, and Erhard Rahm, editors. Schema Matching and Mapping. Data-Centric Systems
and Applications. Springer, Berlin Heidelberg, 2011. ISBN 978-3-642-16517-7. doi: 10.1007/978-3-642-16518-4. URL
http://dx.doi.org/10.1007/978-3-642-16518-4.
[6] Cinzia Capiello, Avigdor Gal, Matthias Jarke, and Jakob Rehof. Data Ecosystems: Sovereign Data Exchange among
Organizations (Dagstuhl Seminar 19391). Dagstuhl Reports, 9(9):66ś134, 2020. ISSN 2192-5283. doi: 10.4230/DagRep.9.9.
66. URL https://drops.dagstuhl.de/opus/volltexte/2020/11845.
[7] Chen Chen, Behzad Golshan, Alon Y Halevy, Wang-Chiew Tan, and AnHai Doan. BigGorilla: An Open-Source
Ecosystem for Data Preparation and Integration. IEEE Data Eng. Bull., 41(2):10ś22, 2018.
[8] Julien Corman, Fernando Florenzano, Juan L Reutter, and Ognjen Savković. Validating SHACL constraints over a
SPARQL endpoint. In International Semantic Web Conference, pages 145ś163, Cham, 2019. Springer.
[9] Federico Croce, Gianluca Cima, Maurizio Lenzerini, and Tiziana Catarci. Ontology-based explanation of classifiers. In
Proceedings of the Workshops of the EDBT/ICDT 2020 Joint Conference, pages 1ś5. CEUR-WS, 2020.
[10] E. Curry and A. Sheth. Next-generation smart environments: From system of systems to data ecosystems. IEEE
Intelligent Systems, 33(3):69ś76, 2018. doi: 10.1109/MIS.2018.033001418.
[11] H. H. Do and E. Rahm. COMA: a system for flexible combination of schema matching approaches. In Proceedings of
VLDB, pages 610ś621. VLDB Endowment, 2002.
[12] Frederika Welle Donker and Bastiaan van Loenen. How to assess the success of the open data ecosystem? International
Journal of Digital Earth, 10(3):284ś306, 2017. doi: 10.1080/17538947.2016.1224938.
[13] Kemele M. Endris, Philipp D. Rohde, Maria-Esther Vidal, and Sören Auer. Ontario: Federated query processing against
a semantic data lake. In Database and Expert Systems Applications - 30th International Conference, DEXA 2019, Linz,
Austria, August 26-29, 2019, Proceedings, Part I, pages 379ś395. Springer, 2019.
[14] EU2018. Ethics guidelines for trustworthy AI. https://ec.europa.eu/digital-single-market/en/news/ethics-guidelinestrustworthy-ai, 2018.
[15] Avigdor Gal. Uncertain schema matching. Synthesis Lectures on Data Management, 3(1):1ś97, 2011.
[16] Tobias Moritz Guggenberger, Frederik Möller, Tim Haarhaus, Inan Gür, and Boris Otto. Ecosystem types in information systems. In Twenty-Eigth European Conference on Information Systems (ECIS2020), pages 1ś21. Association for
Information Systems, 2020.
[17] Rihan Hai, Sandra Geisler, and Christoph Quix. Constance: An intelligent data lake system. In Proc. of the 2016
International Conference on Management of Data, SIGMOD, San Francisco, USA, pages 2097ś2100. ACM, 2016. doi:
10.1145/2882903.2899389.
[18] Pieter Heyvaert, Ben De Meester, Anastasia Dimou, and Ruben Verborgh. Rule-driven inconsistency resolution for
knowledge graph generation rules. Semantic Web, 10(6):1071ś1086, 2019.
[19] Matthias Jarke. Data sovereignty and the Internet of Production. In Advanced Information Systems Engineering, 32nd
International Conference; Grenoble, pages 549ś558, Cham, 2020. Springer.
[20] Manfred A. Jeusfeld, Matthias Jarke, and John Mylopouos. Metamodeling for Method Engineering. MIT Press, 2010.
111:12
S. Geisler, M.-E. Vidal et al.
[21] Samaneh Jozashoori, David Chaves-Fraga, Enrique Iglesias, Maria-Esther Vidal, and Óscar Corcho. Funmap: Efficient
execution of functional mappings for knowledge graph creation. In The International Semantic Web Conference, pages
276ś293, Cham, 2020.
[22] B. Kenig and A. Gal. Mfiblocks: An effective blocking algorithm for entity resolution. Information Systems, 38(6):
908ś926, September 2013.
[23] F. Kitsios, N. Papachristos, and M. Kamariotou. Business models for open data ecosystem: Challenges and motivations
for entrepreneurship and innovation. In 2017 IEEE 19th Conference on Business Informatics (CBI), pages 398ś407. IEEE,
2017. doi: 10.1109/CBI.2017.51.
[24] Pradap Konda, Sanjib Das, Paul Suganthan GC, AnHai Doan, Adel Ardalan, Jeffrey R Ballard, Han Li, Fatemah Panahi,
Haojun Zhang, Jeff Naughton, et al. Magellan: Toward building entity matching management systems. Proceedings of
the VLDB Endowment, 9(12):1197ś1208, 2016.
[25] Maurizio Lenzerini. Data integration: A theoretical perspective. In Proceedings of the Twenty-first ACM SIGACTSIGMOD-SIGART Symposium on Principles of Database Systems, pages 233ś246. ACM, 2002.
[26] I. Lopez de Vallejo, S. Scerri, and T. Tuikka. Towards a european-governed data sharing space. Technical report,
Brussels. BDVA, 2020.
[27] Marcelo Iury S. Oliveira and Bernadette Farias Lóscio. What is a data ecosystem? In Proceedings of the 19th Annual
International Conference on Digital Government Research: Governance in the Data Age, DG.O 2018, Delft, The Netherlands,
May 30 - June 01, 2018, pages 74:1ś74:9. ACM, 2018.
[28] Boris Otto and Matthias Jarke. Designing a multi-sided data platform: findings from the international data spaces case.
Electronic Markets, 29(4):561ś580, October 2019. URL https://doi.org/10.1007/s12525-019-00362-x.
[29] Boris Otto, Dominik Lis, Jan Jürjens, Jan Cirullies, Sebastian Opriel, Falk Howar, Sven Meister, Markus Spiekermann,
Heinrich Pettenpohl, and Frederik Möller. Data ecosystems - conceptual foundations, constituents and recommendations for action. Technical report, Fraunhofer ISST, 2019.
[30] Rufus Pollock. Building the (open) data ecosystem. https://blog.okfn.org/2011/03/31/building-the-open-dataecosystem/, 2011.
[31] Thomas C. Redman. Data Quality: The Field Guide. Digital Press, USA, 2001. ISBN 1555582516.
[32] Theodoros Rekatsinas, Sudeepa Roy, Manasi Vartak, Ce Zhang, and Neoklis Polyzotis. Opportunities for data management research in the era of horizontal AI/ML. Proc. VLDB Endow., 12(12):2323ś2324, 2019.
[33] Ahmad Sakor, Kuldeep Singh, Anery Patel, and Maria-Esther Vidal. Falcon 2.0: An entity and relation linking tool
over wikidata. In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, pages
3141ś3148. ACM, 2020.
[34] Xiaolan Wang, Laura M. Haas, and Alexandra Meliou. Explaining data integration. IEEE Data Eng. Bull., 41(2):47ś58,
2018.
[35] Ruojing Zhang, Marta Indulska, and Shazia W. Sadiq. Discovering data quality problems - the case of repurposed data.
Bus. Inf. Syst. Eng., 61(5):575ś593, 2019. doi: 10.1007/s12599-019-00608-0. URL https://doi.org/10.1007/s12599-01900608-0.