Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

EABlock

Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing

EABlock: A Declarative Entity Alignment Block for Knowledge Graph Creation Pipelines Samaneh Jozashoori Ahmad Sakor TIB - Leibniz Information Centre for Science and Technology Leibniz University of Hannover samaneh.jozashoori@tib.eu TIB - Leibniz Information Centre for Science and Technology Leibniz University of Hannover ahmad.sakor@tib.eu Enrique Iglesias Maria-Esther Vidal L3S Research Center Leibniz University of Hannover iglesias@l3s.de TIB - Leibniz Information Center for Science and Technology Leibniz University of Hannover L3S Research Center maria.vidal@tib.eu ABSTRACT KEYWORDS Despite encoding enormous amount of rich and valuable data, existing data sources are mostly created independently, being a significant challenge to their integration. Mapping languages, e.g., RML and R2RML, facilitate declarative specification of the process of applying meta-data and integrating data into a knowledge graph. Mapping rules can also include knowledge extraction functions in addition to expressing correspondences among data sources and a unified schema. Combining mapping rules and functions represents a powerful formalism to specify pipelines for integrating data into a knowledge graph transparently. Surprisingly, these formalisms are not fully adapted, and many knowledge graphs are created by executing ad-hoc programs to pre-process and integrate data. In this paper, we present EABlock, an approach integrating Entity Alignment (EA) as part of RML mapping rules. EABlock includes a block of functions performing entity recognition from textual attributes and link the recognized entities to the corresponding resources in Wikidata, DBpedia, and domain specific thesaurus, e.g., UMLS. EABlock provides agnostic and efficient techniques to evaluate the functions and transfer the mappings to facilitate its application in any RML-compliant engine. We have empirically evaluated EABlock performance, and results indicate that EABlock speeds up knowledge graph creation pipelines that require entity recognition and linking in state-of-the-art RML-compliant engines. EABlock is also publicly available as a tool through a GitHub repository and a DOI. Knowledge Graph Creation; Semantic Data Integration; Entity Alignment; Mapping Rules; Functional Mappings ACM Reference Format: Samaneh Jozashoori, Ahmad Sakor, Enrique Iglesias, and Maria-Esther Vidal. 2022. EABlock: A Declarative Entity Alignment Block for Knowledge Graph Creation Pipelines. In The 37th ACM/SIGAPP Symposium on Applied Computing (SAC ’22), April 25ś29, 2022, Virtual Event, . ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3477314.3507132 1 INTRODUCTION Knowledge graphs (KGs) represent the convergence among data and knowledge using networks. Albeit coined by the research community for several decades, KGs are playing an increasingly relevant role in scientific and industrial areas [12]. Years of research on semantic data management and knowledge engineering have paved the way for the integration of factual statements spread across various data sources or collected from community-maintained data sources (e.g., Wikidata [22] and DBpedia [1]). The rich spectrum of knowledge represented in existing KGs, position them as sources of background knowledge to empower data-driven processes. Nevertheless, real-world applications require accountable methods to facilitate the traceability of data management processes performed to integrate data into a KG. Thus, KG management needs to be enriched with transparent methods to understand and validate the steps performed to transform disparate data into a unified KG. Data integration systems (DIS) [17] represent generic frameworks to define a KG in terms of a unified schema, a set of data sources, and mapping rules between concepts in the unified schema and the sources. The declarative definition of mapping languages represents a building block for tracking down a KG creation; it also facilitates reusability and modularity. Mapping languages (e.g., R2RML [5] and RML [10]) have been proposed as standards describing correspondences between the concepts in the unified schema (e.g., classes, properties, and relations) and the data sources’ attributes. Thus, by following the global as view (GAV) paradigm [17] where concepts in the unified schema are defined in terms of the sources, they enable the resolution of interoperability conflicts among data sources CCS CONCEPTS · Information systems → Resource Description Framework (RDF); Information extraction. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. SAC ’22, April 25ś29, 2022, Virtual Event, © 2022 Association for Computing Machinery. ACM ISBN 978-1-4503-8713-2/22/04. . . $15.00 https://doi.org/10.1145/3477314.3507132 1908 SAC ’22, April 25–29, 2022, Virtual Event, Jozashoori et al. defined using different schemas. However, data sources may have diverse levels of structuredness (e.g., structured, semi-structured, and unstructured), suffer from data quality issues, or present several interpretations of the same real-world entity. The resolution of these conflicts as part of the process of KG creation can be defined as Data Operators in a Data Ecosystem (DE) proposed by Capiello et al. [3]. Alternatively, mapping languages have been extended to embrace Data Operators as functions that can be included as programming scripts directly in the mapping rules [8, 16, 23] or can follow a declarative approach (e.g., using the Function Ontology, FnO) [6, 7]. They offer clear benefits in comparison to ad-hoc pre- and post-processing techniques in terms of reusability and reproducibility. Nonetheless, the lack of generic frameworks to deal with mapping rules and functions complicates mapping rule design because these functions need to be also implemented. Our Method: We address the problem of EA using target knowledge to solve interoperability conflicts across data sources by proposing a method named EABlock. EABlock is a computational block composed of a set of FnO functions which can be called from RML mapping rules and an efficient strategy to evaluate them. The functions in EABlock are tuned to effectively align entities in a KG with their corresponding entities in existing KGs (e.g., DBpedia [1] and Wikidata [22]) and controlled vocabularies (e.g. UMLS [2]). These functions resort to another engine for solving the tasks of name entity recognition (NER) and entity linking (EL) required for EA; any engine performing NER and EL tasks can be utilized. EABlock follows an eager evaluation strategy and enables the execution of the EABlock functions before the RML mapping rules are executed. This evaluation strategy defined by Jozashoori et al. [15], facilitates the transformation of RML with the EABlock functions into function-free RML mapping rules that can be executed by any RMLcompliant engine without requiring any modification in the engine. EABlock has been developed and experimentally evaluated in realworld datasets collected from DBpedia, Wikidata, and UMLS. The observed outcomes suggest that EABlock functions perform EA to domain-specific and encyclopedic KGs effectively. EABlock is utilized in three international projects to integrate data into the KGs developed in these projects. The results corroborate the role that declaratively defined functions have in KG management. This paper is structured in six additional sections. Section 2 summarizes the state of the art, and section 3 motivates and defines the problem addressed by EABlock. While section 4 provides an overview of background knowledge and preliminary concepts, section 5 formally defines the problem and describes EABlock as the solution, including its proposed strategy and techniques. In section 6 the results of the experimental study are reported and explained. Lastly, section 7 wraps up and outlines the future work. 2 as part of pre-processing can add a considerable overhead on the knowledge graph creation pipeline. Additionally, pre-processing steps are usually developed as ad-hoc programs, which are neither declarative nor easy to maintain. SemTab1 is an effort in benchmarking systems dealing with the tabular data to KG matching problem and present existing challenges [14]. An alternative is to perform EA after the creation of KG, at the expense of creating the same nodes multiple times across different KGs. Zeng [25] provides a comprehensive survey of available techniques to add EA in post-processing, to find the equivalent entities in different created KGs. Lastly, EA can be part of the main pipeline of semantic data integration and knowledge graph creation applying transformation functions. In other words, EA can be involved in the mapping rules that enrich raw data semantically and transform them into RDF model. In this case, EA needs to be defined as a transformation function in the mapping rules. There are different mapping languages enabling the involvement of functions as part of the mapping rules such as RML+FnO [6, 7], R2RML-F [8], FunUL [16], and D-REPR [23]. There also exist different engines capable of processing functions in different languages. For instance, FunMap [15] is able of interpreting function-based mappings in RML+FnO into equivalent function-free mappings in RML efficiently. In spite of all the value that declarative mapping languages and corresponding techniques provide, their potential applications in the task of EA are neither well explored nor appreciated. Hereupon, we aim to fill this gap by enabling and facilitating the application of EA tools as part of mapping rules using transformation functions. 3 MOTIVATING EXAMPLE We motivate our work with a mock example from a real-world scenario illustrated in Figure 1. In this scenario, the aim is to integrate four datasets obtained from different sources into a KG. The datasets consist of a) Patient data extracted from two different clinical notes provided by a general practitioner (GP) and an oncologist including the comorbidities from which the patient is suffering, and b) The drug related data extracted from DrugBank2 including drugdrug-interaction providing information on the possible interactions between different drugs and the impact on effectiveness of each, and drug-disorder data revealing information on list of drugs that can be prescribed for each disorder. A portion of the KG created by a naive approach can be observed in Figure 1. A closer look reveals that the same disorder instance exists as three separated nodes in the graph, i.e., there is an interoperability conflict among them. The existing interoperability issue can be traced back to the raw data where I. the same disorder is represented with different names by clinical physiologists, and II. the name of the disorder is misspelled in one of the records. Another important point is regarding the connection between the instances of the generated KG and instances in available domain-specific sources (e.g., UMLS) or encyclopedic KGs (e.g., DBpedia and Wikidata) which represent the same real-world entities. More specifically, the importance of mentioned connections appears while integrating or linking the other available data/knowledge bases, which are annotated by instances of such sources (i.e., UMLS, DBpedia, and Wikidata). Both RELATED WORK Entity Alignment (EA) is an important solution to overcome interoperability issues while creating a knowledge graph from heterogeneous data sources. Dimou et al. [9], Michel et al. [18], and Vidal et al. [21] propose EA as a pre-processing step, prior to the semantic enrichment and integration of data. In this case, pre-processing performs the task of EA on the whole provided data sources, independent of their involvement in the goal KG. Hence, including EA 1 https://www.cs.ox.ac.uk/isg/challenges/sem-tab/ 2 https://go.drugbank.com/ 1909 EABlock: A Declarative Entity Alignment Block for Knowledge Graph Creation Pipelines SAC ’22, April 25–29, 2022, Virtual Event, Figure 1: Motivating Example. Data integration from four datasets. Different interoperability issues: concept disorder is modeled differently in Dataset 1 and 2. The entity hypertension is represented with various entities, and its name is misspelled in Dataset 1. Performing EA with UMLS, DBpedia, or Wikidata, enables conflict resolution and integration into a KG. observations emphasize the importance of including EA as a module in the pipeline of KG creation. It should be noted that following FAIR principles [24], transparency and reproducibility are essential requirements in pipelines of KG creation. All blocks applied as part of the main process or pre- or post-processing of KG creation should be transparent and traceable. This leads to thinking about an independent transparent module for entity alignment, using a declarative language that can be integrated in any KG creation pipeline that is compliant with the same mapping language. 4 that are modeled using different schemas, e.g., different attributes representing the same concept. Domain: this interoperability conflict occurs among various interpretations of the same entity. They include: i) homonym: the same name is used to represent concepts with different meaning, and ii) synonym: distinct names are used to model the same concept. Figure 1 illustrates the interoperability issues: structuredness between the two data sources of drug-drug interactions; schematic among the attributes of dataset 1 and 2; and domain among the names representing the hypertension. In general, KG creation pipelines include an additional pre- / post-processing block to solve the interoperability issues between data sources. However, this block can be part of the DE as a 𝐷𝑎𝑡𝑎𝑂𝑝𝑒𝑟𝑎𝑡𝑜𝑟𝑠 utilizing the knowledge encoded in the 𝑀𝑒𝑡𝑎𝐷𝑎𝑡𝑎 [11]. KGs are expressed in the Resource Description Framework (RDF), where nodes can be resources or literals, and edges correspond to predicates. RDF resources are identified by IRIs (Internationalized Resource Identifier) or blank nodes (anonymous resources or existential variables), while literals correspond to instances of a data type (e.g., numbers, strings, or dates). Mapping rules in 𝐷𝐸 are declaratively defined using the RDF Mapping Language (RML), an extension of the W3C-standard mapping language R2RML. RML allows for the definition of sources in different formats (e.g., CSV, Relational, JSON, and XML). An RML mapping rule, named TriplesMap, follows the global as view paradigm [17], i.e., concepts in the unified schema are defined in terms of a data source. Figure 2 presents RML TriplesMaps. A rr:subjectMap defines the resources of an RDF class in the unified schema, while a set of predicate-object maps (rr:predicateObjectMap) define the properties and relations of a class. The values of a predicate-object map can be defined in terms of a data source attribute, or as a reference or a join with the rr:subjectMap in another TriplesMap. A reference to another triples map is denoted as rr:RefObjectMap; it can be stated only between triples maps defined over the same data source. Lastly, a rr:JoinCondition represents references between TriplesMap defined on different data sources. A function can define rr:subjectMap or rr:predicateObjectMap. The Function Ontology (FnO) is used to specify functions of the type FunctionMap [6, PRELIMINARIES Knowledge graphs (KGs) are data structures that represent factual knowledge as entities and their relationships using a graph data model [12]. A KG is a directed graph 𝐺=(𝑂,𝑉 ,𝐸), where: 𝑂 is a unified schema that comprises classes, properties, and relations. 𝑉 is a set of nodes in the KG; nodes in V correspond to classes or instances of classes in O. 𝐸 is a set of directed labeled edges in the KG that relate nodes in 𝑉 . Edges are labeled with properties and relations in 𝑂. A KG creation process can be specified in terms of a Data Ecosystem (DE). A DE [3] is defined as a 4-tuple 𝐷𝐸 = ⟨𝐷𝑎𝑡𝑎𝑆𝑒𝑡𝑠, 𝐷𝑎𝑡𝑎𝑂𝑝𝑒𝑟𝑎𝑡𝑜𝑟𝑠, 𝑀𝑒𝑡𝑎-𝐷𝑎𝑡𝑎, 𝑀𝑎𝑝𝑝𝑖𝑛𝑔𝑠⟩ where 𝐷𝑎𝑡𝑎𝑂𝑝𝑒𝑟𝑎𝑡𝑜𝑟𝑠 represent a set of operators that can be executed over data in 𝐷𝑎𝑡𝑎𝑠𝑒𝑡𝑠, including a set of structured or unstructured data sets. 𝑀𝑒𝑡𝑎-𝐷𝑎𝑡𝑎 describes the domain of knowledge and meaning of the data residing in 𝐷𝑎𝑡𝑎𝑆𝑒𝑡𝑠 accordingly. 𝑀𝑒𝑡𝑎-𝐷𝑎𝑡𝑎 comprises: I . Ontologies and controlled vocabularies to provide a unified view of the domain knowledge; II . Properties to describe the data quality, provenance, and access regulations; and III . descriptions of the main characteristics of data. Finally, 𝑀𝑎𝑝𝑝𝑖𝑛𝑔𝑠 represent the correspondences among the concepts and properties in different domain ontologies or associations between data in 𝐷𝑎𝑡𝑎𝑆𝑒𝑡𝑠 and the domain ontology. The same real-world entity can be represented differently in the data sources in 𝐷𝑎𝑡𝑎𝑆𝑒𝑡𝑠. Interoperability issues include: Structuredness: this conflict occurs whenever data sources are described at different levels of structuredness, e.g., structured, semi-structured, and unstructured. Schematic: this interoperability conflict exists among data sources 1910 SAC ’22, April 25–29, 2022, Virtual Event, Jozashoori et al. Figure 2: RML+FnO Triples Maps. a) Drug and DBpedia-TriplesMap are RML triples maps (lines 1-23), while DBpedia-Function is a FnO function (lines 25-35). b) Eager evaluation of FnO functions creates PROJECT1.csv and PROJECT2.csv, and generates function-free RML maps, which can be executed in any RML-compliant engine without a function configuration to be required. 7]. RML TriplesMaps are executed with three operators [13]: i) Simple Object Map (SOM): basic operator that executes a predicateobject map against a data source attributes. ii) Object Reference Map (ORM): this operator evaluates a predicate-object map between two triples maps defined over the same source. The predicate object corresponds to an object of the referred TriplesMap. iii) Object Join Map (OJM): a join condition between two RML TriplesMaps with different data sources is executed with this operator. A 𝐷𝐸 where its 𝑀𝑎𝑝𝑝𝑖𝑛𝑔𝑠 comprises RML TriplesMaps with FnO functions, can be executed following two strategies. a) Lazy evaluation delays the execution of a function until when it is needed to compute a value in a TriplesMap. b) Eager evaluation executes the functions in 𝑀𝑎𝑝𝑝𝑖𝑛𝑔𝑠 over the data sources before these values are needed in the RML triples maps. The lazy evaluation requires an understanding of RML and FnO. Contrary, the eager evaluation enables the transformation of the RML+FnO TriplesMaps into function-free RML triples maps. This evaluation can be done beforehand, and the results can be represented as sources of the translated function-free TriplesMaps. Another advantage of an eager evaluation is that an RML-compliant engine can be used to execute the function-free TriplesMaps and create a KG. Figure 2 a) presents two RML TriplesMaps (lines 1-23) and the function DBpedia-Function is defined in lines 25-35. Following a lazy evaluation, DBpedia-Function is executed each time that a new entity of the class Drug is created. This execution requires that the RML engine is able to execute functions. Moreover, in presence of large number of duplicates in the data sources (i.e., drug.csv), it may be executed several times. On the other hand, Figure 2 b) depicts the translation performed for eager evaluation; this approach is described by Jozashoori et al. [15]. The transformation RML TriplesMaps are evaluated over new data sources. The data source PROJECT1.csv is created from the drug.csv following wellknown properties of the relational algebra (e.g., pushing down the projections and selections into the data sources); in addition to enabling the reduction of the size of data sources they also eliminate duplicates. Furthermore, PROJECT2.csv is created from the materialization of DBpedia-Function. The reference between the two TriplesMaps is expressed using a join condition. 5 OUR APPROACH: EABLOCK Problem Statement: As shown in Figure 1, a KG can comprise entities that correspond to the same real-world entity (e.g., various entities representing hypertension). We address the problem of aligning entities in a KG 𝐺 1 =(𝑂 1 ,𝑉1 ,𝐸 1 ) with entities in an existing KG 𝐺 2 =(𝑂 2 ,𝑉2 ,𝐸 2 ) efficiently. Encyclopedic KGs like DBpedia [1] or Wikidata [22], or domain-specific (e.g., UMLS [2]) correspond to KGs 𝐺 2 against where the alignment is performed. Proposed Solution: Entity alignment from 𝐺 1 to 𝐺 2 , 𝛾 (𝐺 1 | 𝐺 2 ), is defined in terms of an ideal KG, 𝐺 ∗ = (𝑂 ∗, 𝑉 ∗, 𝐸 ∗ ), that includes the nodes and edges in 𝐺 1 and 𝐺 2 plus all the edges that relate nodes in 𝐺 1 with nodes in 𝐺 2 . A solution to 𝛾 (𝐺 1 | 𝐺 2 ) corresponds to a maximal partial function 𝜁 :𝑉1 → 𝑉2 such that 𝛾 (𝐺 1 | 𝐺 2, 𝜁 )={(𝑠 1, 𝑠𝑎𝑚𝑒𝐴𝑠, 𝜁 (𝑠1)) | (𝑠 1, 𝑠𝑎𝑚𝑒𝐴𝑠, 𝜁 (𝑠1)) ∈ 𝐸 ∗ }3 . 𝐷𝐸𝐺 1,2 = ⟨𝐷𝑎𝑡𝑎𝑆𝑒𝑡𝑠 1, 𝐷𝑎𝑡𝑎𝑂𝑝𝑒𝑟𝑎𝑡𝑜𝑟𝑠, 𝑀𝑒𝑡𝑎-𝐷𝑎𝑡𝑎 1, 𝑀𝑎𝑝𝑝𝑖𝑛𝑔𝑠 1,2 ⟩ defines the KG, 𝐺 1,2 =(𝑂 1 ∪ {𝑠𝑎𝑚𝑒𝐴𝑠}, 𝑉1 ∪ 𝑉2, 𝐸 1 ∪ 𝛾 (𝐺 1 | 𝐺 2, 𝜁 )). The set 𝑀𝑎𝑝𝑝𝑖𝑛𝑔𝑠 1,2 is a superset of 𝑀𝑎𝑝𝑝𝑖𝑛𝑔𝑠 1 including all triples maps that define 𝜁 and enable the computation of 𝛾 (𝐺 1 | 𝐺 2, 𝜁 ). EABlock is an approach proposing a computational block to solve entity alignment over textual attributes providing techniques bridging and utilizing all components of a DE i.e., 𝐷𝑎𝑡𝑎𝑆𝑒𝑡𝑠, 𝐷𝑎𝑡𝑎𝑂𝑝𝑒𝑟𝑎𝑡3A partial function 𝜁 :𝑉1 → 𝑉2 is a function from a subset of the 𝑉1 . 𝜁 is maximal in the partial ordered set of all the functions from 𝑉1 → 𝑉2 . 1911 EABlock: A Declarative Entity Alignment Block for Knowledge Graph Creation Pipelines SAC ’22, April 25–29, 2022, Virtual Event, Figure 3: The EABlock components. A set of FnO functions that resorts to an Entity Alignment engine. An Interpreter that executes EABlock functions included in RML+FnO mapping rules and translate these rules into function-free rules. Sequences. Applying this filtering to the background knowledge of Falcon reduces the ambiguity among the resources in the EL task and clears the noise that can be generated from unrelated resource types, e.g., street names. Interpreter connects the previous two components. It follows an eager evaluation strategy of the functions and retrieves the results of the entity alignment generated by the entity alignment tool. The eager evaluation strategy gives the basis for an efficient and RML engine-agnostic execution of the EABlock functions. It resorts to the approach proposed by Jozashoori et al. [15] to translate the input RML+FnO TriplesMaps into function-free RML TriplesMaps. As explained before, EABlock creates a new data set- output dataset- materializing the functions. The output dataset comprises two attributes; input and output attributes (attr1 and attr2). Depending on the category of the function, EABlock deploys one of the following two techniques. a. If the function is a Keyword-based function, for each input value, one record is added to the output dataset including the input value and the retrieved linked entity as the values of attr1 and attr2, respectively. b. However, if the function is Short text-based, after evaluation of the function and receiving the list of linked entities, EABlock generates the output dataset including one record for each entity in the list of linked entities, i.e., for each entity in the list of the retrieved linked entities, one record is added to the output dataset which includes input value and the linked entity as the values of attr1 and attr2, respectively. In this way, EABlock ensures that the generated datasets can be translated by any RML-compliant engine and result in exactly the same RDF triples; since different RML engines may have different interpretations of an RDF list. Implementation and Application EABlock approach is implemented and available as a tool. As a proof of concept, EABlock integrates Falcon2.0 API 4 to perform the NER and EL tasks. Falcon [19, 20] is empowered with background knowledge that allows for the accurate recognition and linking of biomedical concepts. EABlock is developed in Python3, open-source, and licensed under the Apache License 2.0. EABlock is publicly accessible through a GitHub repository5 and Zenodo6 . 𝑜𝑟 , 𝑀𝑒𝑡𝑎-𝐷𝑎𝑡𝑎, 𝑀𝑎𝑝𝑝𝑖𝑛𝑔𝑠: a) EABlock links entities encoded in labels and short text to controlled vocabularies described by metadata and resources in encyclopedic and other domain-specific KGs. For this purpose, EABlock introduces a set of operating functions resorting to an entity and relation linking tool. b) EABlock functions are defined in a human and machine-readable medium, meeting the requirements of meta-data in terms of transparency and reusability. Although the outcome of EABlock representing the aligned entities and annotations provides meta-data for the KG, the addition of EABlock functions to the meta-data of the DE equips this layer for further reproduction or maintenance of the KG with newly added data. c) EABlock functions can be easily integrated into the mappings expressing the relations among the data and the ontology using RML language, applying available extensions of the language. d) EABlock also provides an efficient evaluation strategy to materialize the calls of the functions in the mappings extending data sources and transforming mappings to function-free RML mappings that are adaptable by any RML-compliant KG creation pipeline. As shown in Figure 3, EABlock composes three components: Functions including the signatures of the EABlock functions in FnO. The functions can be divided into two categories based on their domains and ranges. Keyword-based functions receive case-insensitive keywords as input and generate one entity as the output, and Short text-based functions accept a case-insensitive short text as input and output a list of entities. Entity Alignment performs the NER and EL tasks. This component is agnostic, i.e., any tool solving the tasks of Named Entity Recognition (NER) and Entity Linking (EL) through an API can be employed; as a proof of concept, we use Falcon2.0 [20]. Falcon [19, 20] is empowered with background knowledge that allows for the accurate recognition and linking of biomedical concepts. Falcon2.0 relies on a background knowledge built from resources and their corresponding labels from diverse KGs (e.g., DBpedia, Wikidata, and UMLS). The labels in the background knowledge are the textual descriptions of the resources which are connected using the owl:sameAs relation. The background knowledge utilized for Falcon API in EABlock is a subset of the one described in [20]. The background knowledge is filtered by omitting all the resources that are not related to the biomedical domain. The list that is utilized in the filtering process contains the following resource types: Chemicals & Drugs, Anatomy, Disorders, Living Beings, Organizations, Physiology, and Genes & Molecular 4 https://labs.tib.eu/sdm/falconmedical/falcon2/ 5 https://github.com/SDM-TIB/EABlock 6 https://doi.org/10.5281/zenodo.5779773 1912 SAC ’22, April 25–29, 2022, Virtual Event, Jozashoori et al. (a) The performance of a KG creation pipeline applying RocketRML. (b) The performance of a KG creation pipeline applying SDM-RDFizer. Figure 4: Efficiency. The impact of using EABlock in KG creation pipelines applying two different RML-compliant engines. Baseline corresponds to the execution of entity alignment in a pre-processing stage, while EABlock enables the specification of this process in the RML mapping rules. As observed, EABlock reduces the execution time of KG creation pipelines that involve entity alignment tasks in comparison to the application of the same functions but during a pre-processing stage. 6 including RocketRML 8 and SDM-RDFizer 9 . Accordingly, the experiments in one category differs in a. the applied RML-compliant engine and b. whether EABlock is used as part of the pipeline or not. Datasets and Mappings. Considering the parameters that affect the performance of KG creation pipelines [4], we define three different sets of mapping rules, which are distinguished based on the complexity of the rr:TriplesMaps that refers to the EABlock transformation functions. We manipulate the complexity of the mentioned rules by having different number of rr:RefObjectMaps, i.e., zero, one, or two rr:RefObjectMap (referred to as noROM, 1ROM, and 2ROM respectively, in Figure 4). In an attempt to prevent possible effects of data volume on the results of the experiments, we generate two relatively small datasets including 1,000 and 2,000 randomly selected records. Each dataset comprises 22 attributes, two of which are referenced in the mapping rules. Setups. We define two KG creation pipelines, Baseline and EABlock, which execute the same entity alignment tasks and produce the same KG. The Baseline pipeline evaluates RML mapping rules while the entity alignment is performed in a pre-processing step. Contrary, the EABlock pipeline encapsulates these tasks in the EABlock functions that are called in the RML mapping rules. Metrics. Execution time: Elapsed time spent by the whole pipeline to complete the creation of a KG; it is measured as the absolute wall-clock system time as reported by the time command of the Linux operating system. The experiments were run in an Intel(R) Xeon(R) equipped with a CPU E5-2603 v3 @ 1.60GHz 20 cores, 64GB memory and with the O.S. Ubuntu 16.04LTS. Results. Figure 4 illustrates the performance of two approaches of KG creations i.e., Baseline which perform the EA as pre-processing, and EABlock which enables the specification of EA as part of the RML mapping rules. As it can be observed in Figure 4, independent EMPIRICAL EVALUATION Our goal is to empirically assess the performance of the EABlock in the resolution of the problem presented in section 5. The following research questions guide our experimental study: RQ1) What is the impact of applying EABlock in KG creation in terms of execution time? RQ2) How does applying EABlock in the process of KG creation impact the quality of the result KG? RQ3) How sensitive to the quality of input data is EABlock? As a proof of concept, we set up the experiments using biomedical data. Accordingly, we rely on an API of Falcon7 that provides a filtered subset of the background knowledge [20] omitting the resources that are not related to the biomedical domain. A list of related resource types is utilized for filtering the background knowledge. The list contains the following resource types: Chemicals & Drugs, Anatomy, Disorders, Living Beings, Organizations, Physiology, and Genes & Molecular Sequences. Applying this filtering to the background knowledge of Falcon reduces the ambiguity among the resources in the EL task and clears the noise that can be generated by irrelevant resources. 6.1 EABlock Efficiency- RQ1 To evaluate how the performance of a KG creation pipeline may be impacted applying EABlock, we set up 24 KG creation pipelines in overall. Experiments are grouped as Baseline or EABlock; Baseline corresponds to the pipelines where execution of EA is in a pre-processing stage, while EABlock represent the KG creation pipelines in which EABlock enables the specification of EA in the RML mapping rules. Experiments are grouped into six categories, each category utilizing a different 𝐷𝐸, i.e., all the experiments in one category have the same 𝐷𝐸. To avoid any bias caused by the techniques applied in the development of the state-of-the-art engines, we repeat the same experiments by two different available engines 8 https://github.com/semantifyit/RocketRML 9 https://github.com/SDM-TIB/SDM-RDFizer 7 https://labs.tib.eu/sdm/biofalcon/ 1913 EABlock: A Declarative Entity Alignment Block for Knowledge Graph Creation Pipelines of the applied RML-compliant engine utilizing EABlock in all KG pipelines reduces the overall execution time of the KG creation. Figure 4 demonstrates that performing EA as pre-processing is more expensive than using EABlock as part of the main pipeline of KG creations. It can also be observed that in case of having more complex mapping rules, the impact of EABlock in decreasing the execution time is even more considerable and significant. 6.2 SAC ’22, April 25–29, 2022, Virtual Event, is relatively low, but the CUIs annotations and links to DBpedia and Wikidata included in 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 , increase the connectivity in the neighborhoods of 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 . In particular, eablock:Patient, eablock:DrugDisorder, eablock:Annotation, eablock:Disease, wiki:Q12136, wiki:Q11173, and dbo:Drug have a neighborhood connectivity of 8 in 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 . On the other hand, in 𝐺𝑏 , the neighborhood connectivity of eablock:Annotation, eablock:Disease, wiki:Q12136, wiki:Q11173, and dbo:Drug is 0, and eablock:Patient and eablock:DrugDisorderInteraction is 3. These results corroborate that connectivity is enhanced as a result of the entity alignment implemented by the EABlock functions. EABlock Effectiveness - RQ2 We define two pipelines: Baseline and EABlock; the Baseline pipeline includes no entity alignment task. The aim is to evaluate the connectivity in a KG created using the EABlock pipeline and assess RQ2. Datasets and Mappings. We extract data related to drugs (11,293 records), the disorders for which the drugs are prescribed (416 records), and the interactions between the drugs (1,646,836 records) from DrugBank10 (version 5.1.8). We produce three mock datasets resembling normal clinical notes for cancer patients, including the data related to the comorbidities (1,322 records) and prescribed oncological (1,764 records) and non-oncological drugs (1,325 records). We create a unified schema for these datasets and a set 𝐷 4 of RML mappings rules to integrate them. Also, we create a set 𝐷 5 with all the mapping rules in, 𝐷 4 plus the corresponding calls to EABlock functions to execute entity alignment for drugs and disorders. Analysis. Let 𝐾𝐺𝑏 and 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 be the KGs created by the Baseline and EABlock pipelines, respectively. 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 comprises 10,339,870 RDF triples, while 𝐾𝐺𝑏 has 10,200,209. 𝐾𝐺𝑏 and 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 are used to create two labelled directed graphs 𝐺𝑏 = (𝑉 , 𝐸𝑏 ) and 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 = (𝑉 , 𝐸𝑒𝑎𝑏𝑙𝑜𝑐𝑘 ) and traditional network analysis methods are applied to determine connectivity. Vertices in 𝑉 keeps the classes in 𝐾𝐺𝑏 and 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 with at least one resource; 𝐾𝐺𝑏 and 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 have the same resources and literals. A labelled directed edge 𝑒 = (𝑞, 𝑝, 𝑘) belongs to 𝐸𝑏 (resp. to 𝐸𝑒𝑎𝑏𝑙𝑜𝑐𝑘 ) if there are classes 𝑄 and 𝐾 in 𝑉 , and 𝑞 and 𝑘 are instances of 𝑄 and 𝐾 in 𝐾𝐺𝑏 (resp., 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 ) and the RDF triple (𝑞𝑝𝑘) belongs to 𝐾𝐺𝑏 (resp., 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 ). 𝐺𝑏 and 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 provide an aggregated representation of 𝐾𝐺𝑏 and 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 . Figure 5 depicts 𝐺𝑏 and 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 ; 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 is composed of 11 vertices and 39 directed edges. While, 𝐺𝑏 comprises 11 vertices connected by only 10 edges. Table 5c compares 𝐺𝑏 and 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 in terms of graph metrics generated by Cytoscape11 . Metrics. a. Average number of neighbors indicates the average connectivity of a vertex or node in a graph. b. Network diameter measures the shortest path that connects the two most distant nodes in a graph. c. Clustering coefficient measures the tendency of nodes who share the same connections in a graph to become connected. If a neighborhood is fully connected, the clustering coefficient is 1.0 while a value close to 0.0 means that there is no connection in the neighborhood. d. Network density measures the portion of potential edges in a graph that are actually edges; a value close to 1.0 indicates that the graph is fully connected. e. The number of connected components indicates the number of subgraphs composed of vertices connected by at least one path. Results. The results reported in Figure 5c indicate the average number of neighbors in 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 comprises entities that are more connected. Moreover, the clustering coefficient 6.3 EABlock Effectiveness - RQ3 We evaluate the performance of the EABlock and the impact that data quality issues may have on entity alignment. We define 15 tesbeds, five for DBpedia, for Wikidata, and for UMLS. Datasets Gold standard: We create the gold standard datasets by extracting biomedical instances from DBpedia (06-2010) and Wikidata (04-2019) KGs, and the UMLS (November 2020) dataset. We obtain 51,850 records from Wikidata considering the classes łQ12136 (disease)", łQ3736076 (biological function)", łQ43229 (organization)", łQ514 (anatomy)", łQ79529 (chemical substance)", and łQ863908 (nucleic acid sequence)". From DBpedia, we extract 43,680 records related to the classes łChemicalSubstance", łDisease", łGene", and łPersonFunction". We consider the whole dataset of UMLS including 3,741,395 records. Nonetheless, to avoid having a huge difference between the number of records in the experiments of DBpedia and Wikidata, we randomly select 1,496,557 records from the UMLS dataset. Meanwhile, to decrease the chance of possible biases, for each of the five experiments, we generate a new randomly selected dataset of UMLS with the same number of records, i.e., 1,496,55712 . Testbeds: For each DBpedia, Wikidata, and UMLS, five testbeds are generated by manipulating the gold standard datasets considering frequent quality issues that may exist in datasets; character capitalization, elimination, insertion, and replacement. All created testbeds related to each DBpedia, Wikidata, and UMLS possess lower quality than the gold standard datasets due to the intentionall misspelling errors that we have generated in values of the records. The errors include: a) capitalizing all the characters of a record value; b) randomly eliminating one character from the value; c) randomly replacing one character with another randomly selected character; and d) inserting a randomly selected character to a random location in the record value. Accordingly, each of mentioned errors are introduced in 50% of the records in one of the testbeds, the other 50% of the records carry the same values as the gold standards. The last testbed created by including all four types of errors, has the lowest quality. In this dataset, each 20% out of 80% of the records involves exactly one of the four errors. Therefore, 20% of the records are error-free, in contrast to the other four test beds, in which 50% of all records are free from errors. Metrics. We assess the sensitivity of EABlock in terms of I. precision; the fraction of correct results of entity alignments from all the entity alignment results returned by EABlock, II. recall; the fraction of correct results of performed entity alignment from the all expected results to be returned by EABlock, III. F1 score; the harmonic mean 10 https://go.drugbank.com/ 11 https://cytoscape.org/ 12 https://tib.eu/cloud/s/XJiqDDAHqM8Fw5K 1914 SAC ’22, April 25–29, 2022, Virtual Event, Jozashoori et al. Analysis Number of nodes Number of edges Avg. number of neighbors Network diameter Clustering coefficient Network density Connected components EABlock (𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 ) 11 39 3.091 2 0.184 0.155 1 Baseline (𝐺𝑏 ) 11 10 1.273 1 0.098 0.064 6 (c) Graph metrics for 𝐺𝑏 and 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 (a) Directed Labelled Graph 𝐺𝑏 (b) Directed Labelled Graph 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 Figure 5: Connectivity Analysis. Baseline and EABlock pipelines generate 𝐾𝐺𝑏 and 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 , respectively. 𝐾𝐺𝑏 and 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 have the same classes and entities. However, 𝐾𝐺𝑏 does not include the entity alignments to UMLS, Wikidata, and DBpedia added to 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 . 𝐺𝑏 and 𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 are directed labelled graphs that provide an aggregated representation of 𝐾𝐺𝑏 and 𝐾𝐺𝑒𝑎𝑏𝑙𝑜𝑐𝑘 . The values of the graph metrics corroborate that connectivity is increased by entity alignment performed by the EABlock pipeline. łB-cell_chronic_lymphocytic_leukemia"16 while based on the gold standard the correct link is łChronic_leukemia"17 . of precision and recall. Results. Table 1 demonstrates the results of running EABlock over the five configurations of error types explained in Testbeds. The entity alignment engine (Falcon) reports relatively high performance. Cleaning the background knowledge of Falcon and filtering it to contain only resources related to the biomedical domain, as explained in section 5, reduces the ambiguity among the resources in the background knowledge; this improves the performance of Falcon significantly and plays a major role in having such high performance. Also, having the input of the entity alignment module as keywords without any noise, e.g., stopwords, helps the module recognize and link the labels precisely. Moreover, Table 1 suggests that the used entity alignment module is able to overcome the proposed error types. There are records for which EABlock fails to return the expected linked entity based on the gold standard. These failures are mostly caused by data quality issues in the KGs. To clarify, we enumerate a couple of these examples with possible explanations: a) there are cases in which the linked entities retrieved by the entity alignment engine (Falcon) are correct, despite the fact that their identifiers differ from those available in the gold standard. For instance, for the keyword łmalignant histiocytosis" EABlock retrieves łQ164952"13 from Wikidata, while, in the gold standard the Wikidata identifier for the same keyword appears to be łQ52962465"14 . However, both identifiers lead to the same entry on Wikidata; for some keywords, more than one identifier exists in such KGs. b) Another example of unexpected retrieved linked entities, can be observed in case of having long combinations of keywords, such as łearly infantile epileptic encephalopathy 19". In this case, EABlock links the first entity that can be recognized by the first couple of keywords; łQ61913448"15 which belongs to the label łearly infantile epileptic encephalopathy 37". The same failure case can be observed in retrieval from DBpedia as well; for the keyword łChronic leukemia" EABlock retrieves a link to 6.4 Disscusion As it is emphasized in previous sections, EABlock can be utilized in any KG creation pipeline that applies an RML-compliant engine for the tasks of mapping rules translation and RDF triples generation. In spite of the fact that, there is a W3C community targeting the issues regarding the declarative KG creation using R2RML and RML mapping languages 18 , there is still a lack of formal definitions which is a general barrier decelerating the adoption of different KG creation workflows and engines. Additionally, the absence of formal definitions negatively impacts on extending available state of the art. Another aspect of being agnostic is the application of a tool that can perform the NER and EL tasks. Despite all the values this independence brings, the performance of EABlock is subject to change based on the choice of the entity alignment tool. 7 CONCLUSIONS AND FUTURE WORK The diverse interoperability issues existing in textual data and the demand of having a transparent, traceable, and efficient pipeline of KG creation led us to introduce EABlock. EABlock is an approach to solve entity alignment problems by capturing knowledge from existing KGs while keeping the procedure transparent and traceable. With an eager evaluation strategy and efficient translation of mapping rules into function-free rules, EABlock ensures not to sacrifice the efficiency at the cost of reproducibility. The observed experimental results show the benefits of grounding solutions for KG creation in the well-established problems like NER, NL, and data integration systems. Thus, EABlock broadens the repertory of approaches for KG creation and provides the basis for developing real-world KGs. Our vision is that EABlock will be the starting point 13 https://www.wikidata.org/wiki/Q164952 14 http://www.wikidata.org/entity/Q52962465 15 https://www.wikidata.org/wiki/Q61913448 16 https://dbpedia.org/page/Chronic_lymphocytic_leukemia 17 https://dbpedia.org/page/Chronic_leukemia 18 https://www.w3.org/community/kg-construct/ 1915 EABlock: A Declarative Entity Alignment Block for Knowledge Graph Creation Pipelines Table 1: Effectiveness. The performance of EABlock is assessed in 15 datasets. The aligned entities are compared with the resources of the original labels. Misspelling errors are added with five types of transformations. Entity alignment is performed by the keyword-based functions over the keywords generated by the five transformations. The aligned entities are compared with the resources of the original labels. EABlock fails to recognize long labels generated from keywords with non-alphanumerical characters present in UMLS (e.g., n-((2r)-1-(((2r)-1-(((2r)-6-amino-1-(4-amino-4carboxy-1-piperidinyl)-1-oxo-2-hexanyl)amino)-4-methyl1-oxo-2-pentanyl)amino)-1-oxo-3-phenyl-2-propanyl)-dphenylalaninamide). UMLS Text Error Type Precision Capitalization of all characters 1.0 Elimination of a character 0.99 Replacement of a character 0.99 Insertion of a new character 0.99 Combination of all 4 errors 1.0 DBpedia Text Error Type Precision Capitalization of all characters 0.78 Elimination of a character 0.78 Replacement of a character 0.98 Insertion of a new character 0.78 Combination of all 4 errors 0.78 Wikidata Text Error Type Precision Capitalization of all characters 0.99 Elimination of a character 0.99 Replacement of a character 0.99 Insertion of a new character 0.99 Combination of all 4 errors 0.99 Recall 0.97 0.74 0.75 0.50 0.78 F1 Score 0.99 0.85 0.85 0.66 0.88 Recall 0.78 0.78 0.98 0.78 0.78 F1 Score 0.78 0.78 0.98 0.78 0.78 Recall 0.99 0.99 0.99 0.99 0.99 F1 Score 0.99 0.99 0.99 0.99 0.99 SAC ’22, April 25–29, 2022, Virtual Event, [7] Ben De Meester, Wouter Maroy, Anastasia Dimou, Ruben Verborgh, and Erik Mannens. 2017. Declarative data transformations for Linked Data generation: the case of DBpedia. In European Semantic Web Conference. [8] Christophe Debruyne and Declan O’Sullivan. 2016. R2RML-F: Towards Sharing and Executing Domain Logic in R2RML Mappings. In LDOW Workshop. [9] Anastasia Dimou, Gerald Haesendonck, Martin Vanbrabant, Laurens De Vocht, Ruben Verborgh, Steven Latré, and Erik Mannens. 2017. ILastic: Linked Data Generation Workflow and User Interface for iMinds Scholarly Data. In Semantics, Analytics, Visualization. Springer, 15ś32. [10] Anastasia Dimou, Miel Vander Sande, Pieter Colpaert, Ruben Verborgh, Erik Mannens, and Rik Van de Walle. 2014. RML: a generic language for integrated RDF mappings of heterogeneous data. In Ldow. [11] Sandra Geisler, Maria-Esther Vidal, Cinzia Cappiello, Bernadette Farias Lóscio, Avigdor Gal, Matthias Jarke, Maurizio Lenzerini, Paolo Missier, Boris Otto, Elda Paja, Barbara Pernici, and Jakob Rehof. 2021. Knowledge-driven Data Ecosystems Towards Data Transparency. Accepted at the Special Issue of Data Transparency, ACM Journal Data and Information Quality (2021). [12] Claudio Gutiérrez and Juan F. Sequeda. 2021. Knowledge graphs. Commun. ACM 64, 3 (2021), 96ś104. [13] Enrique Iglesias, Samaneh Jozashoori, David Chaves-Fraga, Diego Collarana, and Maria-Esther Vidal. 2020. SDM-RDFizer: An RML Interpreter for the Efficient Creation of RDF Knowledge Graphs. In ACM International Conference on Information and Knowledge Management CIKM. [14] Ernesto Jiménez-Ruiz, Oktie Hassanzadeh, Vasilis Efthymiou, Jiaoyan Chen, Kavitha Srinivas, and Vincenzo Cutrona. 2020. Results of semtab 2020. In CEUR Workshop Proceedings, Vol. 2775. 1ś8. [15] Samaneh Jozashoori, David Chaves-Fraga, Enrique Iglesias, Maria-Esther Vidal, and Oscar Corcho. 2020. FunMap: Efficient Execution of Functional Mappings for Knowledge Graph Creation. In International Semantic Web Conference. [16] Ademar Crotti Junior, Christophe Debruyne, Rob Brennan, and Declan O’Sullivan. 2016. FunUL: a method to incorporate functions into uplift mapping languages. In Intern. Confer. on Information Integration and Web-based Applications and Services. [17] Maurizio Lenzerini. 2002. Data integration: A theoretical perspective. In ACM Symposium on Principles of Database Systems. [18] Franck Michel, Fabien Gandon, Valentin Ah-Kane, Anna Bobasheva, Elena Cabrio, Olivier Corby, Raphaël Gazzotti, Alain Giboin, Santiago Marro, Tobias Mayer, et al. 2020. Covid-on-the-Web: Knowledge graph and services to advance COVID-19 research. In International Semantic Web Conference. Springer, 294ś310. [19] Ahmad Sakor, Isaiah Onando Mulang, Kuldeep Singh, Saeedeh Shekarpour, MariaEsther Vidal, Jens Lehmann, and Sören Auer. 2019. Old is Gold: Linguistic Driven Approach for Entity and Relation Linking of Short Text. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT. [20] Ahmad Sakor, Kuldeep Singh, Anery Patel, and Maria-Esther Vidal. 2020. Falcon 2.0: An Entity and Relation Linking Tool over Wikidata. In International Conference on Information and Knowledge Management (CIKM). [21] Maria-Esther Vidal, Kemele M Endris, Samaneh Jazashoori, Ahmad Sakor, and Ariam Rivas. 2019. Transforming heterogeneous data into knowledge for personalized treatmentsÐA use case. Datenbank-Spektrum 19, 2 (2019). [22] D. Vrandecic and M. Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM 57, 10 (2014), 78ś85. [23] Binh Vu, Jay Pujara, and Craig A Knoblock. 2019. D-REPR: A Language for Describing and Mapping Diversely-Structured Data Sources to RDF. In Intern. Confer. on Knowledge Capture. [24] Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data 3, 1 (2016), 1ś9. [25] Kaisheng Zeng, Chengjiang Li, Lei Hou, Juanzi Li, and Ling Feng. [n.d.]. A comprehensive survey of entity alignment for knowledge graphs. AI Open 2021 ([n. d.]). for the development of FnO functions which can be made available following the FAIR principles. In the future, we will empower the EABlock with a set of functions to overcome interoperability problems across multilingual and unstructured datasets. Acknowledgments. This work has been partially supported by the EU H2020 RIA funded projects: CLARIFY with grant agreement (GA) No. 875160, P4-LUCAT with GA No. 53000015, and PLATOON with GA No. 872592. REFERENCES [1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. 2007. DBpedia: A Nucleus for a Web of Open Data. In Proceedings of ISWC + ASWC. [2] O. Bodenreider. 2004. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 1, 32 (2004). [3] Cinzia Capiello, Avigdor Gal, Matthias Jarke, and Jakob Rehof. 2020. Data ecosystems: sovereign data exchange among organizations (Dagstuhl Seminar 19391). In Dagstuhl Reports, Vol. 9. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. [4] David Chaves-Fraga, Kemele M Endris, Enrique Iglesias, Oscar Corcho, and Maria-Esther Vidal. 2019. What are the Parameters that Affect the Construction of a Knowledge Graph?. In OTM Conferences. [5] Souripriya Das, Seema Sundara, and Richard Cyganiak. 2012. R2RML: RDB to RDF Mapping Language, W3C Recommendation 27 September 2012. W3C (2012). [6] Ben De Meester, Anastasia Dimou, Ruben Verborgh, and Erik Mannens. 2016. An ontology to semantically declare and describe functions. In European Semantic Web Conference. Springer, 46ś49. 1916