EABlock: A Declarative Entity Alignment Block for Knowledge
Graph Creation Pipelines
Samaneh Jozashoori
Ahmad Sakor
TIB - Leibniz Information Centre
for Science and Technology
Leibniz University of Hannover
samaneh.jozashoori@tib.eu
TIB - Leibniz Information Centre
for Science and Technology
Leibniz University of Hannover
ahmad.sakor@tib.eu
Enrique Iglesias
Maria-Esther Vidal
L3S Research Center
Leibniz University of Hannover
iglesias@l3s.de
TIB - Leibniz Information Center
for Science and Technology
Leibniz University of Hannover
L3S Research Center
maria.vidal@tib.eu
ABSTRACT
KEYWORDS
Despite encoding enormous amount of rich and valuable data, existing data sources are mostly created independently, being a significant challenge to their integration. Mapping languages, e.g., RML
and R2RML, facilitate declarative specification of the process of
applying meta-data and integrating data into a knowledge graph.
Mapping rules can also include knowledge extraction functions in
addition to expressing correspondences among data sources and a
unified schema. Combining mapping rules and functions represents
a powerful formalism to specify pipelines for integrating data into
a knowledge graph transparently. Surprisingly, these formalisms
are not fully adapted, and many knowledge graphs are created by
executing ad-hoc programs to pre-process and integrate data. In this
paper, we present EABlock, an approach integrating Entity Alignment (EA) as part of RML mapping rules. EABlock includes a block
of functions performing entity recognition from textual attributes
and link the recognized entities to the corresponding resources
in Wikidata, DBpedia, and domain specific thesaurus, e.g., UMLS.
EABlock provides agnostic and efficient techniques to evaluate the
functions and transfer the mappings to facilitate its application in
any RML-compliant engine. We have empirically evaluated EABlock
performance, and results indicate that EABlock speeds up knowledge graph creation pipelines that require entity recognition and
linking in state-of-the-art RML-compliant engines. EABlock is also
publicly available as a tool through a GitHub repository and a DOI.
Knowledge Graph Creation; Semantic Data Integration; Entity
Alignment; Mapping Rules; Functional Mappings
ACM Reference Format:
Samaneh Jozashoori, Ahmad Sakor, Enrique Iglesias, and Maria-Esther
Vidal. 2022. EABlock: A Declarative Entity Alignment Block for Knowledge
Graph Creation Pipelines. In The 37th ACM/SIGAPP Symposium on Applied
Computing (SAC ’22), April 25ś29, 2022, Virtual Event, . ACM, New York, NY,
USA, 9 pages. https://doi.org/10.1145/3477314.3507132
1
INTRODUCTION
Knowledge graphs (KGs) represent the convergence among data
and knowledge using networks. Albeit coined by the research community for several decades, KGs are playing an increasingly relevant
role in scientific and industrial areas [12]. Years of research on semantic data management and knowledge engineering have paved
the way for the integration of factual statements spread across
various data sources or collected from community-maintained data
sources (e.g., Wikidata [22] and DBpedia [1]). The rich spectrum of
knowledge represented in existing KGs, position them as sources
of background knowledge to empower data-driven processes. Nevertheless, real-world applications require accountable methods to
facilitate the traceability of data management processes performed
to integrate data into a KG. Thus, KG management needs to be
enriched with transparent methods to understand and validate the
steps performed to transform disparate data into a unified KG.
Data integration systems (DIS) [17] represent generic frameworks
to define a KG in terms of a unified schema, a set of data sources,
and mapping rules between concepts in the unified schema and the
sources. The declarative definition of mapping languages represents
a building block for tracking down a KG creation; it also facilitates
reusability and modularity. Mapping languages (e.g., R2RML [5] and
RML [10]) have been proposed as standards describing correspondences between the concepts in the unified schema (e.g., classes,
properties, and relations) and the data sources’ attributes. Thus, by
following the global as view (GAV) paradigm [17] where concepts
in the unified schema are defined in terms of the sources, they enable the resolution of interoperability conflicts among data sources
CCS CONCEPTS
· Information systems → Resource Description Framework
(RDF); Information extraction.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
SAC ’22, April 25ś29, 2022, Virtual Event,
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-8713-2/22/04. . . $15.00
https://doi.org/10.1145/3477314.3507132
1908
SAC ’22, April 25–29, 2022, Virtual Event,
Jozashoori et al.
defined using different schemas. However, data sources may have
diverse levels of structuredness (e.g., structured, semi-structured,
and unstructured), suffer from data quality issues, or present several interpretations of the same real-world entity. The resolution of
these conflicts as part of the process of KG creation can be defined
as Data Operators in a Data Ecosystem (DE) proposed by Capiello
et al. [3]. Alternatively, mapping languages have been extended
to embrace Data Operators as functions that can be included as
programming scripts directly in the mapping rules [8, 16, 23] or
can follow a declarative approach (e.g., using the Function Ontology, FnO) [6, 7]. They offer clear benefits in comparison to ad-hoc
pre- and post-processing techniques in terms of reusability and
reproducibility. Nonetheless, the lack of generic frameworks to
deal with mapping rules and functions complicates mapping rule
design because these functions need to be also implemented. Our
Method: We address the problem of EA using target knowledge
to solve interoperability conflicts across data sources by proposing a method named EABlock. EABlock is a computational block
composed of a set of FnO functions which can be called from RML
mapping rules and an efficient strategy to evaluate them. The functions in EABlock are tuned to effectively align entities in a KG with
their corresponding entities in existing KGs (e.g., DBpedia [1] and
Wikidata [22]) and controlled vocabularies (e.g. UMLS [2]). These
functions resort to another engine for solving the tasks of name
entity recognition (NER) and entity linking (EL) required for EA;
any engine performing NER and EL tasks can be utilized. EABlock
follows an eager evaluation strategy and enables the execution of
the EABlock functions before the RML mapping rules are executed.
This evaluation strategy defined by Jozashoori et al. [15], facilitates the transformation of RML with the EABlock functions into
function-free RML mapping rules that can be executed by any RMLcompliant engine without requiring any modification in the engine.
EABlock has been developed and experimentally evaluated in realworld datasets collected from DBpedia, Wikidata, and UMLS. The
observed outcomes suggest that EABlock functions perform EA
to domain-specific and encyclopedic KGs effectively. EABlock is
utilized in three international projects to integrate data into the
KGs developed in these projects. The results corroborate the role
that declaratively defined functions have in KG management.
This paper is structured in six additional sections. Section 2 summarizes the state of the art, and section 3 motivates and defines
the problem addressed by EABlock. While section 4 provides an
overview of background knowledge and preliminary concepts, section 5 formally defines the problem and describes EABlock as the
solution, including its proposed strategy and techniques. In section
6 the results of the experimental study are reported and explained.
Lastly, section 7 wraps up and outlines the future work.
2
as part of pre-processing can add a considerable overhead on the
knowledge graph creation pipeline. Additionally, pre-processing
steps are usually developed as ad-hoc programs, which are neither
declarative nor easy to maintain. SemTab1 is an effort in benchmarking systems dealing with the tabular data to KG matching
problem and present existing challenges [14]. An alternative is to
perform EA after the creation of KG, at the expense of creating
the same nodes multiple times across different KGs. Zeng [25] provides a comprehensive survey of available techniques to add EA in
post-processing, to find the equivalent entities in different created
KGs. Lastly, EA can be part of the main pipeline of semantic data
integration and knowledge graph creation applying transformation functions. In other words, EA can be involved in the mapping
rules that enrich raw data semantically and transform them into
RDF model. In this case, EA needs to be defined as a transformation function in the mapping rules. There are different mapping
languages enabling the involvement of functions as part of the
mapping rules such as RML+FnO [6, 7], R2RML-F [8], FunUL [16],
and D-REPR [23]. There also exist different engines capable of processing functions in different languages. For instance, FunMap [15]
is able of interpreting function-based mappings in RML+FnO into
equivalent function-free mappings in RML efficiently. In spite of all
the value that declarative mapping languages and corresponding
techniques provide, their potential applications in the task of EA
are neither well explored nor appreciated. Hereupon, we aim to fill
this gap by enabling and facilitating the application of EA tools as
part of mapping rules using transformation functions.
3
MOTIVATING EXAMPLE
We motivate our work with a mock example from a real-world
scenario illustrated in Figure 1. In this scenario, the aim is to integrate four datasets obtained from different sources into a KG. The
datasets consist of a) Patient data extracted from two different clinical notes provided by a general practitioner (GP) and an oncologist
including the comorbidities from which the patient is suffering, and
b) The drug related data extracted from DrugBank2 including drugdrug-interaction providing information on the possible interactions
between different drugs and the impact on effectiveness of each,
and drug-disorder data revealing information on list of drugs that
can be prescribed for each disorder. A portion of the KG created
by a naive approach can be observed in Figure 1. A closer look
reveals that the same disorder instance exists as three separated
nodes in the graph, i.e., there is an interoperability conflict among
them. The existing interoperability issue can be traced back to the
raw data where I. the same disorder is represented with different
names by clinical physiologists, and II. the name of the disorder
is misspelled in one of the records. Another important point is
regarding the connection between the instances of the generated
KG and instances in available domain-specific sources (e.g., UMLS)
or encyclopedic KGs (e.g., DBpedia and Wikidata) which represent
the same real-world entities. More specifically, the importance of
mentioned connections appears while integrating or linking the
other available data/knowledge bases, which are annotated by instances of such sources (i.e., UMLS, DBpedia, and Wikidata). Both
RELATED WORK
Entity Alignment (EA) is an important solution to overcome interoperability issues while creating a knowledge graph from heterogeneous data sources. Dimou et al. [9], Michel et al. [18], and Vidal et
al. [21] propose EA as a pre-processing step, prior to the semantic
enrichment and integration of data. In this case, pre-processing
performs the task of EA on the whole provided data sources, independent of their involvement in the goal KG. Hence, including EA
1 https://www.cs.ox.ac.uk/isg/challenges/sem-tab/
2 https://go.drugbank.com/
1909
EABlock: A Declarative Entity Alignment Block for Knowledge Graph Creation Pipelines
SAC ’22, April 25–29, 2022, Virtual Event,
Figure 1: Motivating Example. Data integration from four datasets. Different interoperability issues: concept disorder is modeled
differently in Dataset 1 and 2. The entity hypertension is represented with various entities, and its name is misspelled in
Dataset 1. Performing EA with UMLS, DBpedia, or Wikidata, enables conflict resolution and integration into a KG.
observations emphasize the importance of including EA as a module in the pipeline of KG creation. It should be noted that following
FAIR principles [24], transparency and reproducibility are essential requirements in pipelines of KG creation. All blocks applied as
part of the main process or pre- or post-processing of KG creation
should be transparent and traceable. This leads to thinking about
an independent transparent module for entity alignment, using
a declarative language that can be integrated in any KG creation
pipeline that is compliant with the same mapping language.
4
that are modeled using different schemas, e.g., different attributes
representing the same concept. Domain: this interoperability conflict occurs among various interpretations of the same entity. They
include: i) homonym: the same name is used to represent concepts with different meaning, and ii) synonym: distinct names are
used to model the same concept. Figure 1 illustrates the interoperability issues: structuredness between the two data sources of
drug-drug interactions; schematic among the attributes of dataset
1 and 2; and domain among the names representing the hypertension. In general, KG creation pipelines include an additional
pre- / post-processing block to solve the interoperability issues
between data sources. However, this block can be part of the DE
as a 𝐷𝑎𝑡𝑎𝑂𝑝𝑒𝑟𝑎𝑡𝑜𝑟𝑠 utilizing the knowledge encoded in the 𝑀𝑒𝑡𝑎𝐷𝑎𝑡𝑎 [11]. KGs are expressed in the Resource Description Framework (RDF), where nodes can be resources or literals, and edges
correspond to predicates. RDF resources are identified by IRIs (Internationalized Resource Identifier) or blank nodes (anonymous
resources or existential variables), while literals correspond to instances of a data type (e.g., numbers, strings, or dates). Mapping
rules in 𝐷𝐸 are declaratively defined using the RDF Mapping Language (RML), an extension of the W3C-standard mapping language
R2RML. RML allows for the definition of sources in different formats (e.g., CSV, Relational, JSON, and XML). An RML mapping
rule, named TriplesMap, follows the global as view paradigm [17],
i.e., concepts in the unified schema are defined in terms of a data
source. Figure 2 presents RML TriplesMaps. A rr:subjectMap defines the resources of an RDF class in the unified schema, while a
set of predicate-object maps (rr:predicateObjectMap) define the
properties and relations of a class. The values of a predicate-object
map can be defined in terms of a data source attribute, or as a reference or a join with the rr:subjectMap in another TriplesMap. A
reference to another triples map is denoted as rr:RefObjectMap; it
can be stated only between triples maps defined over the same data
source. Lastly, a rr:JoinCondition represents references between
TriplesMap defined on different data sources. A function can define
rr:subjectMap or rr:predicateObjectMap. The Function Ontology (FnO) is used to specify functions of the type FunctionMap [6,
PRELIMINARIES
Knowledge graphs (KGs) are data structures that represent factual
knowledge as entities and their relationships using a graph data
model [12]. A KG is a directed graph 𝐺=(𝑂,𝑉 ,𝐸), where: 𝑂 is a
unified schema that comprises classes, properties, and relations.
𝑉 is a set of nodes in the KG; nodes in V correspond to classes
or instances of classes in O. 𝐸 is a set of directed labeled edges
in the KG that relate nodes in 𝑉 . Edges are labeled with properties and relations in 𝑂. A KG creation process can be specified in
terms of a Data Ecosystem (DE). A DE [3] is defined as a 4-tuple
𝐷𝐸 = ⟨𝐷𝑎𝑡𝑎𝑆𝑒𝑡𝑠, 𝐷𝑎𝑡𝑎𝑂𝑝𝑒𝑟𝑎𝑡𝑜𝑟𝑠, 𝑀𝑒𝑡𝑎-𝐷𝑎𝑡𝑎, 𝑀𝑎𝑝𝑝𝑖𝑛𝑔𝑠⟩ where
𝐷𝑎𝑡𝑎𝑂𝑝𝑒𝑟𝑎𝑡𝑜𝑟𝑠 represent a set of operators that can be executed
over data in 𝐷𝑎𝑡𝑎𝑠𝑒𝑡𝑠, including a set of structured or unstructured data sets. 𝑀𝑒𝑡𝑎-𝐷𝑎𝑡𝑎 describes the domain of knowledge and
meaning of the data residing in 𝐷𝑎𝑡𝑎𝑆𝑒𝑡𝑠 accordingly. 𝑀𝑒𝑡𝑎-𝐷𝑎𝑡𝑎
comprises: I . Ontologies and controlled vocabularies to provide a
unified view of the domain knowledge; II . Properties to describe
the data quality, provenance, and access regulations; and III . descriptions of the main characteristics of data. Finally, 𝑀𝑎𝑝𝑝𝑖𝑛𝑔𝑠
represent the correspondences among the concepts and properties in different domain ontologies or associations between data
in 𝐷𝑎𝑡𝑎𝑆𝑒𝑡𝑠 and the domain ontology. The same real-world entity can be represented differently in the data sources in 𝐷𝑎𝑡𝑎𝑆𝑒𝑡𝑠.
Interoperability issues include: Structuredness: this conflict occurs whenever data sources are described at different levels of
structuredness, e.g., structured, semi-structured, and unstructured.
Schematic: this interoperability conflict exists among data sources
1910
SAC ’22, April 25–29, 2022, Virtual Event,
Jozashoori et al.
Figure 2: RML+FnO Triples Maps. a) Drug and DBpedia-TriplesMap are RML triples maps (lines 1-23), while DBpedia-Function
is a FnO function (lines 25-35). b) Eager evaluation of FnO functions creates PROJECT1.csv and PROJECT2.csv, and generates
function-free RML maps, which can be executed in any RML-compliant engine without a function configuration to be required.
7]. RML TriplesMaps are executed with three operators [13]: i) Simple Object Map (SOM): basic operator that executes a predicateobject map against a data source attributes. ii) Object Reference
Map (ORM): this operator evaluates a predicate-object map between
two triples maps defined over the same source. The predicate object
corresponds to an object of the referred TriplesMap. iii) Object
Join Map (OJM): a join condition between two RML TriplesMaps
with different data sources is executed with this operator.
A 𝐷𝐸 where its 𝑀𝑎𝑝𝑝𝑖𝑛𝑔𝑠 comprises RML TriplesMaps with FnO
functions, can be executed following two strategies. a) Lazy evaluation delays the execution of a function until when it is needed to
compute a value in a TriplesMap. b) Eager evaluation executes the
functions in 𝑀𝑎𝑝𝑝𝑖𝑛𝑔𝑠 over the data sources before these values
are needed in the RML triples maps. The lazy evaluation requires
an understanding of RML and FnO. Contrary, the eager evaluation enables the transformation of the RML+FnO TriplesMaps
into function-free RML triples maps. This evaluation can be done
beforehand, and the results can be represented as sources of the
translated function-free TriplesMaps. Another advantage of an
eager evaluation is that an RML-compliant engine can be used to
execute the function-free TriplesMaps and create a KG. Figure 2
a) presents two RML TriplesMaps (lines 1-23) and the function
DBpedia-Function is defined in lines 25-35. Following a lazy evaluation, DBpedia-Function is executed each time that a new entity of the class Drug is created. This execution requires that the
RML engine is able to execute functions. Moreover, in presence
of large number of duplicates in the data sources (i.e., drug.csv),
it may be executed several times. On the other hand, Figure 2 b)
depicts the translation performed for eager evaluation; this approach is described by Jozashoori et al. [15]. The transformation
RML TriplesMaps are evaluated over new data sources. The data
source PROJECT1.csv is created from the drug.csv following wellknown properties of the relational algebra (e.g., pushing down the
projections and selections into the data sources); in addition to
enabling the reduction of the size of data sources they also eliminate duplicates. Furthermore, PROJECT2.csv is created from the
materialization of DBpedia-Function. The reference between the
two TriplesMaps is expressed using a join condition.
5
OUR APPROACH: EABLOCK
Problem Statement: As shown in Figure 1, a KG can comprise
entities that correspond to the same real-world entity (e.g., various entities representing hypertension). We address the problem
of aligning entities in a KG 𝐺 1 =(𝑂 1 ,𝑉1 ,𝐸 1 ) with entities in an existing KG 𝐺 2 =(𝑂 2 ,𝑉2 ,𝐸 2 ) efficiently. Encyclopedic KGs like DBpedia [1] or Wikidata [22], or domain-specific (e.g., UMLS [2]) correspond to KGs 𝐺 2 against where the alignment is performed. Proposed Solution: Entity alignment from 𝐺 1 to 𝐺 2 , 𝛾 (𝐺 1 | 𝐺 2 ),
is defined in terms of an ideal KG, 𝐺 ∗ = (𝑂 ∗, 𝑉 ∗, 𝐸 ∗ ), that includes the nodes and edges in 𝐺 1 and 𝐺 2 plus all the edges that
relate nodes in 𝐺 1 with nodes in 𝐺 2 . A solution to 𝛾 (𝐺 1 | 𝐺 2 )
corresponds to a maximal partial function 𝜁 :𝑉1 → 𝑉2 such that
𝛾 (𝐺 1 | 𝐺 2, 𝜁 )={(𝑠 1, 𝑠𝑎𝑚𝑒𝐴𝑠, 𝜁 (𝑠1)) | (𝑠 1, 𝑠𝑎𝑚𝑒𝐴𝑠, 𝜁 (𝑠1)) ∈ 𝐸 ∗ }3 .
𝐷𝐸𝐺 1,2 = ⟨𝐷𝑎𝑡𝑎𝑆𝑒𝑡𝑠 1, 𝐷𝑎𝑡𝑎𝑂𝑝𝑒𝑟𝑎𝑡𝑜𝑟𝑠, 𝑀𝑒𝑡𝑎-𝐷𝑎𝑡𝑎 1, 𝑀𝑎𝑝𝑝𝑖𝑛𝑔𝑠 1,2 ⟩
defines the KG, 𝐺 1,2 =(𝑂 1 ∪ {𝑠𝑎𝑚𝑒𝐴𝑠}, 𝑉1 ∪ 𝑉2, 𝐸 1 ∪ 𝛾 (𝐺 1 | 𝐺 2, 𝜁 )).
The set 𝑀𝑎𝑝𝑝𝑖𝑛𝑔𝑠 1,2 is a superset of 𝑀𝑎𝑝𝑝𝑖𝑛𝑔𝑠 1 including all triples
maps that define 𝜁 and enable the computation of 𝛾 (𝐺 1 | 𝐺 2, 𝜁 ).
EABlock is an approach proposing a computational block to solve
entity alignment over textual attributes providing techniques bridging and utilizing all components of a DE i.e., 𝐷𝑎𝑡𝑎𝑆𝑒𝑡𝑠, 𝐷𝑎𝑡𝑎𝑂𝑝𝑒𝑟𝑎𝑡3A
partial function 𝜁 :𝑉1 → 𝑉2 is a function from a subset of the 𝑉1 . 𝜁 is maximal in
the partial ordered set of all the functions from 𝑉1 → 𝑉2 .
1911
EABlock: A Declarative Entity Alignment Block for Knowledge Graph Creation Pipelines
SAC ’22, April 25–29, 2022, Virtual Event,
Figure 3: The EABlock components. A set of FnO functions that resorts to an Entity Alignment engine. An Interpreter that
executes EABlock functions included in RML+FnO mapping rules and translate these rules into function-free rules.
Sequences. Applying this filtering to the background knowledge of
Falcon reduces the ambiguity among the resources in the EL task
and clears the noise that can be generated from unrelated resource
types, e.g., street names. Interpreter connects the previous two
components. It follows an eager evaluation strategy of the functions and retrieves the results of the entity alignment generated
by the entity alignment tool. The eager evaluation strategy gives
the basis for an efficient and RML engine-agnostic execution of
the EABlock functions. It resorts to the approach proposed by Jozashoori et al. [15] to translate the input RML+FnO TriplesMaps
into function-free RML TriplesMaps. As explained before, EABlock
creates a new data set- output dataset- materializing the functions.
The output dataset comprises two attributes; input and output attributes (attr1 and attr2). Depending on the category of the function,
EABlock deploys one of the following two techniques. a. If the function is a Keyword-based function, for each input value, one record
is added to the output dataset including the input value and the
retrieved linked entity as the values of attr1 and attr2, respectively.
b. However, if the function is Short text-based, after evaluation of
the function and receiving the list of linked entities, EABlock generates the output dataset including one record for each entity in the
list of linked entities, i.e., for each entity in the list of the retrieved
linked entities, one record is added to the output dataset which
includes input value and the linked entity as the values of attr1 and
attr2, respectively. In this way, EABlock ensures that the generated
datasets can be translated by any RML-compliant engine and result
in exactly the same RDF triples; since different RML engines may
have different interpretations of an RDF list. Implementation and
Application EABlock approach is implemented and available as a
tool. As a proof of concept, EABlock integrates Falcon2.0 API 4 to
perform the NER and EL tasks. Falcon [19, 20] is empowered with
background knowledge that allows for the accurate recognition and
linking of biomedical concepts. EABlock is developed in Python3,
open-source, and licensed under the Apache License 2.0. EABlock
is publicly accessible through a GitHub repository5 and Zenodo6 .
𝑜𝑟 , 𝑀𝑒𝑡𝑎-𝐷𝑎𝑡𝑎, 𝑀𝑎𝑝𝑝𝑖𝑛𝑔𝑠: a) EABlock links entities encoded in labels and short text to controlled vocabularies described by metadata and resources in encyclopedic and other domain-specific KGs.
For this purpose, EABlock introduces a set of operating functions
resorting to an entity and relation linking tool. b) EABlock functions
are defined in a human and machine-readable medium, meeting the
requirements of meta-data in terms of transparency and reusability.
Although the outcome of EABlock representing the aligned entities
and annotations provides meta-data for the KG, the addition of
EABlock functions to the meta-data of the DE equips this layer for
further reproduction or maintenance of the KG with newly added
data. c) EABlock functions can be easily integrated into the mappings expressing the relations among the data and the ontology
using RML language, applying available extensions of the language.
d) EABlock also provides an efficient evaluation strategy to materialize the calls of the functions in the mappings extending data sources
and transforming mappings to function-free RML mappings that
are adaptable by any RML-compliant KG creation pipeline.
As shown in Figure 3, EABlock composes three components: Functions including the signatures of the EABlock functions in FnO. The
functions can be divided into two categories based on their domains
and ranges. Keyword-based functions receive case-insensitive keywords as input and generate one entity as the output, and Short
text-based functions accept a case-insensitive short text as input
and output a list of entities. Entity Alignment performs the NER
and EL tasks. This component is agnostic, i.e., any tool solving
the tasks of Named Entity Recognition (NER) and Entity Linking
(EL) through an API can be employed; as a proof of concept, we
use Falcon2.0 [20]. Falcon [19, 20] is empowered with background
knowledge that allows for the accurate recognition and linking of
biomedical concepts. Falcon2.0 relies on a background knowledge
built from resources and their corresponding labels from diverse
KGs (e.g., DBpedia, Wikidata, and UMLS). The labels in the background knowledge are the textual descriptions of the resources
which are connected using the owl:sameAs relation. The background knowledge utilized for Falcon API in EABlock is a subset
of the one described in [20]. The background knowledge is filtered
by omitting all the resources that are not related to the biomedical
domain. The list that is utilized in the filtering process contains the
following resource types: Chemicals & Drugs, Anatomy, Disorders,
Living Beings, Organizations, Physiology, and Genes & Molecular
4 https://labs.tib.eu/sdm/falconmedical/falcon2/
5 https://github.com/SDM-TIB/EABlock
6 https://doi.org/10.5281/zenodo.5779773
1912
SAC ’22, April 25–29, 2022, Virtual Event,
Jozashoori et al.
(a) The performance of a KG creation pipeline applying RocketRML.
(b) The performance of a KG creation pipeline applying SDM-RDFizer.
Figure 4: Efficiency. The impact of using EABlock in KG creation pipelines applying two different RML-compliant engines.
Baseline corresponds to the execution of entity alignment in a pre-processing stage, while EABlock enables the specification of
this process in the RML mapping rules. As observed, EABlock reduces the execution time of KG creation pipelines that involve
entity alignment tasks in comparison to the application of the same functions but during a pre-processing stage.
6
including RocketRML 8 and SDM-RDFizer 9 . Accordingly, the experiments in one category differs in a. the applied RML-compliant
engine and b. whether EABlock is used as part of the pipeline or not.
Datasets and Mappings. Considering the parameters that affect
the performance of KG creation pipelines [4], we define three different sets of mapping rules, which are distinguished based on the
complexity of the rr:TriplesMaps that refers to the EABlock transformation functions. We manipulate the complexity of the mentioned rules by having different number of rr:RefObjectMaps, i.e.,
zero, one, or two rr:RefObjectMap (referred to as noROM, 1ROM,
and 2ROM respectively, in Figure 4). In an attempt to prevent possible effects of data volume on the results of the experiments, we
generate two relatively small datasets including 1,000 and 2,000 randomly selected records. Each dataset comprises 22 attributes, two
of which are referenced in the mapping rules. Setups. We define
two KG creation pipelines, Baseline and EABlock, which execute the
same entity alignment tasks and produce the same KG. The Baseline
pipeline evaluates RML mapping rules while the entity alignment is
performed in a pre-processing step. Contrary, the EABlock pipeline
encapsulates these tasks in the EABlock functions that are called
in the RML mapping rules. Metrics. Execution time: Elapsed time
spent by the whole pipeline to complete the creation of a KG; it is
measured as the absolute wall-clock system time as reported by the
time command of the Linux operating system. The experiments
were run in an Intel(R) Xeon(R) equipped with a CPU E5-2603 v3 @
1.60GHz 20 cores, 64GB memory and with the O.S. Ubuntu 16.04LTS.
Results. Figure 4 illustrates the performance of two approaches of
KG creations i.e., Baseline which perform the EA as pre-processing,
and EABlock which enables the specification of EA as part of the
RML mapping rules. As it can be observed in Figure 4, independent
EMPIRICAL EVALUATION
Our goal is to empirically assess the performance of the EABlock in
the resolution of the problem presented in section 5. The following
research questions guide our experimental study: RQ1) What is
the impact of applying EABlock in KG creation in terms of execution time? RQ2) How does applying EABlock in the process of KG
creation impact the quality of the result KG? RQ3) How sensitive
to the quality of input data is EABlock? As a proof of concept, we
set up the experiments using biomedical data. Accordingly, we rely
on an API of Falcon7 that provides a filtered subset of the background knowledge [20] omitting the resources that are not related
to the biomedical domain. A list of related resource types is utilized
for filtering the background knowledge. The list contains the following resource types: Chemicals & Drugs, Anatomy, Disorders,
Living Beings, Organizations, Physiology, and Genes & Molecular
Sequences. Applying this filtering to the background knowledge of
Falcon reduces the ambiguity among the resources in the EL task
and clears the noise that can be generated by irrelevant resources.
6.1
EABlock Efficiency- RQ1
To evaluate how the performance of a KG creation pipeline may
be impacted applying EABlock, we set up 24 KG creation pipelines
in overall. Experiments are grouped as Baseline or EABlock; Baseline corresponds to the pipelines where execution of EA is in a
pre-processing stage, while EABlock represent the KG creation
pipelines in which EABlock enables the specification of EA in the
RML mapping rules. Experiments are grouped into six categories,
each category utilizing a different 𝐷𝐸, i.e., all the experiments in one
category have the same 𝐷𝐸. To avoid any bias caused by the techniques applied in the development of the state-of-the-art engines,
we repeat the same experiments by two different available engines
8 https://github.com/semantifyit/RocketRML
9 https://github.com/SDM-TIB/SDM-RDFizer
7 https://labs.tib.eu/sdm/biofalcon/
1913
EABlock: A Declarative Entity Alignment Block for Knowledge Graph Creation Pipelines
of the applied RML-compliant engine utilizing EABlock in all KG
pipelines reduces the overall execution time of the KG creation.
Figure 4 demonstrates that performing EA as pre-processing is
more expensive than using EABlock as part of the main pipeline of
KG creations. It can also be observed that in case of having more
complex mapping rules, the impact of EABlock in decreasing the
execution time is even more considerable and significant.
6.2
SAC ’22, April 25–29, 2022, Virtual Event,
is relatively low, but the CUIs annotations and links to DBpedia
and Wikidata included in 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 , increase the connectivity in
the neighborhoods of 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 . In particular, eablock:Patient,
eablock:DrugDisorder, eablock:Annotation, eablock:Disease, wiki:Q12136, wiki:Q11173, and dbo:Drug have a neighborhood connectivity of 8 in 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 . On the other hand, in 𝐺𝑏 , the
neighborhood connectivity of eablock:Annotation, eablock:Disease, wiki:Q12136, wiki:Q11173, and dbo:Drug is 0, and eablock:Patient and eablock:DrugDisorderInteraction is 3. These
results corroborate that connectivity is enhanced as a result of the
entity alignment implemented by the EABlock functions.
EABlock Effectiveness - RQ2
We define two pipelines: Baseline and EABlock; the Baseline pipeline
includes no entity alignment task. The aim is to evaluate the connectivity in a KG created using the EABlock pipeline and assess
RQ2. Datasets and Mappings. We extract data related to drugs
(11,293 records), the disorders for which the drugs are prescribed
(416 records), and the interactions between the drugs (1,646,836
records) from DrugBank10 (version 5.1.8). We produce three mock
datasets resembling normal clinical notes for cancer patients, including the data related to the comorbidities (1,322 records) and
prescribed oncological (1,764 records) and non-oncological drugs
(1,325 records). We create a unified schema for these datasets and a
set 𝐷 4 of RML mappings rules to integrate them. Also, we create
a set 𝐷 5 with all the mapping rules in, 𝐷 4 plus the corresponding
calls to EABlock functions to execute entity alignment for drugs
and disorders. Analysis. Let 𝐾𝐺𝑏 and 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 be the KGs created by the Baseline and EABlock pipelines, respectively. 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘
comprises 10,339,870 RDF triples, while 𝐾𝐺𝑏 has 10,200,209. 𝐾𝐺𝑏
and 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 are used to create two labelled directed graphs
𝐺𝑏 = (𝑉 , 𝐸𝑏 ) and 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 = (𝑉 , 𝐸𝑒𝑎𝑏𝑙𝑜𝑐𝑘 ) and traditional network
analysis methods are applied to determine connectivity. Vertices
in 𝑉 keeps the classes in 𝐾𝐺𝑏 and 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 with at least one
resource; 𝐾𝐺𝑏 and 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 have the same resources and literals. A labelled directed edge 𝑒 = (𝑞, 𝑝, 𝑘) belongs to 𝐸𝑏 (resp. to
𝐸𝑒𝑎𝑏𝑙𝑜𝑐𝑘 ) if there are classes 𝑄 and 𝐾 in 𝑉 , and 𝑞 and 𝑘 are instances of 𝑄 and 𝐾 in 𝐾𝐺𝑏 (resp., 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 ) and the RDF triple
(𝑞𝑝𝑘) belongs to 𝐾𝐺𝑏 (resp., 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 ). 𝐺𝑏 and 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 provide
an aggregated representation of 𝐾𝐺𝑏 and 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 . Figure 5 depicts 𝐺𝑏 and 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 ; 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 is composed of 11 vertices and 39
directed edges. While, 𝐺𝑏 comprises 11 vertices connected by only
10 edges. Table 5c compares 𝐺𝑏 and 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 in terms of graph
metrics generated by Cytoscape11 . Metrics. a. Average number of
neighbors indicates the average connectivity of a vertex or node in
a graph. b. Network diameter measures the shortest path that connects the two most distant nodes in a graph. c. Clustering coefficient
measures the tendency of nodes who share the same connections
in a graph to become connected. If a neighborhood is fully connected, the clustering coefficient is 1.0 while a value close to 0.0
means that there is no connection in the neighborhood. d. Network
density measures the portion of potential edges in a graph that
are actually edges; a value close to 1.0 indicates that the graph is
fully connected. e. The number of connected components indicates
the number of subgraphs composed of vertices connected by at
least one path. Results. The results reported in Figure 5c indicate
the average number of neighbors in 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 comprises entities that are more connected. Moreover, the clustering coefficient
6.3
EABlock Effectiveness - RQ3
We evaluate the performance of the EABlock and the impact that
data quality issues may have on entity alignment. We define 15
tesbeds, five for DBpedia, for Wikidata, and for UMLS. Datasets
Gold standard: We create the gold standard datasets by extracting biomedical instances from DBpedia (06-2010) and Wikidata
(04-2019) KGs, and the UMLS (November 2020) dataset. We obtain
51,850 records from Wikidata considering the classes łQ12136 (disease)", łQ3736076 (biological function)", łQ43229 (organization)",
łQ514 (anatomy)", łQ79529 (chemical substance)", and łQ863908
(nucleic acid sequence)". From DBpedia, we extract 43,680 records
related to the classes łChemicalSubstance", łDisease", łGene", and
łPersonFunction". We consider the whole dataset of UMLS including
3,741,395 records. Nonetheless, to avoid having a huge difference
between the number of records in the experiments of DBpedia and
Wikidata, we randomly select 1,496,557 records from the UMLS
dataset. Meanwhile, to decrease the chance of possible biases, for
each of the five experiments, we generate a new randomly selected
dataset of UMLS with the same number of records, i.e., 1,496,55712 .
Testbeds: For each DBpedia, Wikidata, and UMLS, five testbeds
are generated by manipulating the gold standard datasets considering frequent quality issues that may exist in datasets; character
capitalization, elimination, insertion, and replacement. All created
testbeds related to each DBpedia, Wikidata, and UMLS possess
lower quality than the gold standard datasets due to the intentionall misspelling errors that we have generated in values of the
records. The errors include: a) capitalizing all the characters of
a record value; b) randomly eliminating one character from the
value; c) randomly replacing one character with another randomly
selected character; and d) inserting a randomly selected character
to a random location in the record value. Accordingly, each of mentioned errors are introduced in 50% of the records in one of the
testbeds, the other 50% of the records carry the same values as the
gold standards. The last testbed created by including all four types
of errors, has the lowest quality. In this dataset, each 20% out of 80%
of the records involves exactly one of the four errors. Therefore,
20% of the records are error-free, in contrast to the other four test
beds, in which 50% of all records are free from errors. Metrics.
We assess the sensitivity of EABlock in terms of I. precision; the
fraction of correct results of entity alignments from all the entity
alignment results returned by EABlock, II. recall; the fraction of
correct results of performed entity alignment from the all expected
results to be returned by EABlock, III. F1 score; the harmonic mean
10 https://go.drugbank.com/
11 https://cytoscape.org/
12 https://tib.eu/cloud/s/XJiqDDAHqM8Fw5K
1914
SAC ’22, April 25–29, 2022, Virtual Event,
Jozashoori et al.
Analysis
Number of nodes
Number of edges
Avg. number of neighbors
Network diameter
Clustering coefficient
Network density
Connected components
EABlock (𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 )
11
39
3.091
2
0.184
0.155
1
Baseline (𝐺𝑏 )
11
10
1.273
1
0.098
0.064
6
(c) Graph metrics for 𝐺𝑏 and 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘
(a) Directed Labelled Graph 𝐺𝑏
(b) Directed Labelled Graph 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘
Figure 5: Connectivity Analysis. Baseline and EABlock pipelines generate 𝐾𝐺𝑏 and 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 , respectively. 𝐾𝐺𝑏 and 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘
have the same classes and entities. However, 𝐾𝐺𝑏 does not include the entity alignments to UMLS, Wikidata, and DBpedia added
to 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 . 𝐺𝑏 and 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 are directed labelled graphs that provide an aggregated representation of 𝐾𝐺𝑏 and 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 . The
values of the graph metrics corroborate that connectivity is increased by entity alignment performed by the EABlock pipeline.
łB-cell_chronic_lymphocytic_leukemia"16 while based on the gold
standard the correct link is łChronic_leukemia"17 .
of precision and recall. Results. Table 1 demonstrates the results
of running EABlock over the five configurations of error types explained in Testbeds. The entity alignment engine (Falcon) reports
relatively high performance. Cleaning the background knowledge
of Falcon and filtering it to contain only resources related to the
biomedical domain, as explained in section 5, reduces the ambiguity
among the resources in the background knowledge; this improves
the performance of Falcon significantly and plays a major role in
having such high performance. Also, having the input of the entity
alignment module as keywords without any noise, e.g., stopwords,
helps the module recognize and link the labels precisely. Moreover,
Table 1 suggests that the used entity alignment module is able to
overcome the proposed error types. There are records for which
EABlock fails to return the expected linked entity based on the gold
standard. These failures are mostly caused by data quality issues in
the KGs. To clarify, we enumerate a couple of these examples with
possible explanations: a) there are cases in which the linked entities retrieved by the entity alignment engine (Falcon) are correct,
despite the fact that their identifiers differ from those available in
the gold standard. For instance, for the keyword łmalignant histiocytosis" EABlock retrieves łQ164952"13 from Wikidata, while, in the
gold standard the Wikidata identifier for the same keyword appears
to be łQ52962465"14 . However, both identifiers lead to the same
entry on Wikidata; for some keywords, more than one identifier
exists in such KGs. b) Another example of unexpected retrieved
linked entities, can be observed in case of having long combinations
of keywords, such as łearly infantile epileptic encephalopathy 19".
In this case, EABlock links the first entity that can be recognized
by the first couple of keywords; łQ61913448"15 which belongs to
the label łearly infantile epileptic encephalopathy 37". The same
failure case can be observed in retrieval from DBpedia as well;
for the keyword łChronic leukemia" EABlock retrieves a link to
6.4 Disscusion
As it is emphasized in previous sections, EABlock can be utilized in
any KG creation pipeline that applies an RML-compliant engine for
the tasks of mapping rules translation and RDF triples generation.
In spite of the fact that, there is a W3C community targeting the
issues regarding the declarative KG creation using R2RML and
RML mapping languages 18 , there is still a lack of formal definitions
which is a general barrier decelerating the adoption of different
KG creation workflows and engines. Additionally, the absence of
formal definitions negatively impacts on extending available state
of the art. Another aspect of being agnostic is the application of a
tool that can perform the NER and EL tasks. Despite all the values
this independence brings, the performance of EABlock is subject to
change based on the choice of the entity alignment tool.
7
CONCLUSIONS AND FUTURE WORK
The diverse interoperability issues existing in textual data and the
demand of having a transparent, traceable, and efficient pipeline of
KG creation led us to introduce EABlock. EABlock is an approach
to solve entity alignment problems by capturing knowledge from
existing KGs while keeping the procedure transparent and traceable. With an eager evaluation strategy and efficient translation
of mapping rules into function-free rules, EABlock ensures not to
sacrifice the efficiency at the cost of reproducibility. The observed
experimental results show the benefits of grounding solutions for
KG creation in the well-established problems like NER, NL, and
data integration systems. Thus, EABlock broadens the repertory of
approaches for KG creation and provides the basis for developing
real-world KGs. Our vision is that EABlock will be the starting point
13 https://www.wikidata.org/wiki/Q164952
14 http://www.wikidata.org/entity/Q52962465
15 https://www.wikidata.org/wiki/Q61913448
16 https://dbpedia.org/page/Chronic_lymphocytic_leukemia
17 https://dbpedia.org/page/Chronic_leukemia
18 https://www.w3.org/community/kg-construct/
1915
EABlock: A Declarative Entity Alignment Block for Knowledge Graph Creation Pipelines
Table 1: Effectiveness. The performance of EABlock is assessed in 15 datasets. The aligned entities are compared with
the resources of the original labels. Misspelling errors are
added with five types of transformations. Entity alignment
is performed by the keyword-based functions over the
keywords generated by the five transformations. The aligned
entities are compared with the resources of the original
labels. EABlock fails to recognize long labels generated from
keywords with non-alphanumerical characters present in
UMLS (e.g., n-((2r)-1-(((2r)-1-(((2r)-6-amino-1-(4-amino-4carboxy-1-piperidinyl)-1-oxo-2-hexanyl)amino)-4-methyl1-oxo-2-pentanyl)amino)-1-oxo-3-phenyl-2-propanyl)-dphenylalaninamide).
UMLS
Text Error Type
Precision
Capitalization of all characters
1.0
Elimination of a character
0.99
Replacement of a character
0.99
Insertion of a new character
0.99
Combination of all 4 errors
1.0
DBpedia
Text Error Type
Precision
Capitalization of all characters
0.78
Elimination of a character
0.78
Replacement of a character
0.98
Insertion of a new character
0.78
Combination of all 4 errors
0.78
Wikidata
Text Error Type
Precision
Capitalization of all characters
0.99
Elimination of a character
0.99
Replacement of a character
0.99
Insertion of a new character
0.99
Combination of all 4 errors
0.99
Recall
0.97
0.74
0.75
0.50
0.78
F1 Score
0.99
0.85
0.85
0.66
0.88
Recall
0.78
0.78
0.98
0.78
0.78
F1 Score
0.78
0.78
0.98
0.78
0.78
Recall
0.99
0.99
0.99
0.99
0.99
F1 Score
0.99
0.99
0.99
0.99
0.99
SAC ’22, April 25–29, 2022, Virtual Event,
[7] Ben De Meester, Wouter Maroy, Anastasia Dimou, Ruben Verborgh, and Erik
Mannens. 2017. Declarative data transformations for Linked Data generation:
the case of DBpedia. In European Semantic Web Conference.
[8] Christophe Debruyne and Declan O’Sullivan. 2016. R2RML-F: Towards Sharing
and Executing Domain Logic in R2RML Mappings. In LDOW Workshop.
[9] Anastasia Dimou, Gerald Haesendonck, Martin Vanbrabant, Laurens De Vocht,
Ruben Verborgh, Steven Latré, and Erik Mannens. 2017. ILastic: Linked Data
Generation Workflow and User Interface for iMinds Scholarly Data. In Semantics,
Analytics, Visualization. Springer, 15ś32.
[10] Anastasia Dimou, Miel Vander Sande, Pieter Colpaert, Ruben Verborgh, Erik
Mannens, and Rik Van de Walle. 2014. RML: a generic language for integrated
RDF mappings of heterogeneous data. In Ldow.
[11] Sandra Geisler, Maria-Esther Vidal, Cinzia Cappiello, Bernadette Farias Lóscio,
Avigdor Gal, Matthias Jarke, Maurizio Lenzerini, Paolo Missier, Boris Otto, Elda
Paja, Barbara Pernici, and Jakob Rehof. 2021. Knowledge-driven Data Ecosystems
Towards Data Transparency. Accepted at the Special Issue of Data Transparency,
ACM Journal Data and Information Quality (2021).
[12] Claudio Gutiérrez and Juan F. Sequeda. 2021. Knowledge graphs. Commun. ACM
64, 3 (2021), 96ś104.
[13] Enrique Iglesias, Samaneh Jozashoori, David Chaves-Fraga, Diego Collarana,
and Maria-Esther Vidal. 2020. SDM-RDFizer: An RML Interpreter for the Efficient Creation of RDF Knowledge Graphs. In ACM International Conference on
Information and Knowledge Management CIKM.
[14] Ernesto Jiménez-Ruiz, Oktie Hassanzadeh, Vasilis Efthymiou, Jiaoyan Chen,
Kavitha Srinivas, and Vincenzo Cutrona. 2020. Results of semtab 2020. In CEUR
Workshop Proceedings, Vol. 2775. 1ś8.
[15] Samaneh Jozashoori, David Chaves-Fraga, Enrique Iglesias, Maria-Esther Vidal,
and Oscar Corcho. 2020. FunMap: Efficient Execution of Functional Mappings
for Knowledge Graph Creation. In International Semantic Web Conference.
[16] Ademar Crotti Junior, Christophe Debruyne, Rob Brennan, and Declan O’Sullivan.
2016. FunUL: a method to incorporate functions into uplift mapping languages. In
Intern. Confer. on Information Integration and Web-based Applications and Services.
[17] Maurizio Lenzerini. 2002. Data integration: A theoretical perspective. In ACM
Symposium on Principles of Database Systems.
[18] Franck Michel, Fabien Gandon, Valentin Ah-Kane, Anna Bobasheva, Elena Cabrio,
Olivier Corby, Raphaël Gazzotti, Alain Giboin, Santiago Marro, Tobias Mayer, et al.
2020. Covid-on-the-Web: Knowledge graph and services to advance COVID-19
research. In International Semantic Web Conference. Springer, 294ś310.
[19] Ahmad Sakor, Isaiah Onando Mulang, Kuldeep Singh, Saeedeh Shekarpour, MariaEsther Vidal, Jens Lehmann, and Sören Auer. 2019. Old is Gold: Linguistic Driven
Approach for Entity and Relation Linking of Short Text. In Conference of the
North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, NAACL-HLT.
[20] Ahmad Sakor, Kuldeep Singh, Anery Patel, and Maria-Esther Vidal. 2020. Falcon 2.0: An Entity and Relation Linking Tool over Wikidata. In International
Conference on Information and Knowledge Management (CIKM).
[21] Maria-Esther Vidal, Kemele M Endris, Samaneh Jazashoori, Ahmad Sakor, and
Ariam Rivas. 2019. Transforming heterogeneous data into knowledge for personalized treatmentsÐA use case. Datenbank-Spektrum 19, 2 (2019).
[22] D. Vrandecic and M. Krötzsch. 2014. Wikidata: a free collaborative knowledgebase.
Commun. ACM 57, 10 (2014), 78ś85.
[23] Binh Vu, Jay Pujara, and Craig A Knoblock. 2019. D-REPR: A Language for
Describing and Mapping Diversely-Structured Data Sources to RDF. In Intern.
Confer. on Knowledge Capture.
[24] Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino
da Silva Santos, Philip E Bourne, et al. 2016. The FAIR Guiding Principles for
scientific data management and stewardship. Scientific data 3, 1 (2016), 1ś9.
[25] Kaisheng Zeng, Chengjiang Li, Lei Hou, Juanzi Li, and Ling Feng. [n.d.]. A
comprehensive survey of entity alignment for knowledge graphs. AI Open 2021
([n. d.]).
for the development of FnO functions which can be made available following the FAIR principles. In the future, we will empower
the EABlock with a set of functions to overcome interoperability
problems across multilingual and unstructured datasets.
Acknowledgments. This work has been partially supported by
the EU H2020 RIA funded projects: CLARIFY with grant agreement
(GA) No. 875160, P4-LUCAT with GA No. 53000015, and PLATOON
with GA No. 872592.
REFERENCES
[1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. 2007. DBpedia:
A Nucleus for a Web of Open Data. In Proceedings of ISWC + ASWC.
[2] O. Bodenreider. 2004. The Unified Medical Language System (UMLS): integrating
biomedical terminology. Nucleic Acids Res 1, 32 (2004).
[3] Cinzia Capiello, Avigdor Gal, Matthias Jarke, and Jakob Rehof. 2020. Data ecosystems: sovereign data exchange among organizations (Dagstuhl Seminar 19391).
In Dagstuhl Reports, Vol. 9. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
[4] David Chaves-Fraga, Kemele M Endris, Enrique Iglesias, Oscar Corcho, and
Maria-Esther Vidal. 2019. What are the Parameters that Affect the Construction
of a Knowledge Graph?. In OTM Conferences.
[5] Souripriya Das, Seema Sundara, and Richard Cyganiak. 2012. R2RML: RDB to
RDF Mapping Language, W3C Recommendation 27 September 2012. W3C (2012).
[6] Ben De Meester, Anastasia Dimou, Ruben Verborgh, and Erik Mannens. 2016. An
ontology to semantically declare and describe functions. In European Semantic
Web Conference. Springer, 46ś49.
1916