Record-level matching rules are chains of similarity join predicates on multiple attributes emplo... more Record-level matching rules are chains of similarity join predicates on multiple attributes employed to join records that refer to the same real-world object when an explicit foreign key is not available on the data sets at hand. They are widely employed by data scientists and practitioners that work with data lakes, open data, and data in the wild. In this work we present a novel technique that allows to efficiently execute record-level matching rules on parallel and distributed systems and demonstrate its efficiency on a real-wold data set.
Entity Resolution and Data Fusion are fundamental tasks in a Data Integration process. Unfortunat... more Entity Resolution and Data Fusion are fundamental tasks in a Data Integration process. Unfortunately, these tasks cannot be completely addressed by purely automated methods and, then, a "humanin-the-loop" approach, i.e., the interaction with the Integration Designer has to be considered. In fact, the application goal can be relevant to reduce the complexity and the cost of the whole integration process. Moreover, the Entity Resolution and Data Fusion tasks are often considered consecutive and independent of each other: the output of the first step is used as input of the second one. In this paper, we will show how these tasks have not to be considered independent. In fact, the evaluation of data fusion results is fundamental for the Integration Designer to analyze, and eventually modify, the choices made during the Entity Resolution process. To show this, our highly scalable Entity Resolution tool, SparkER, will be extended with post-processing high-quality methods for matching. These methods will be integrated in the MOMIS Data Fusion system, extended as well with metrics for the evaluation of data fusion results.
Entity Resolution (ER) is a fundamental task of data integration: it identifies different represe... more Entity Resolution (ER) is a fundamental task of data integration: it identifies different representations (i.e., profiles) of the same real-world entity in databases. To compare all possible profile pairs through an ER algorithm has a quadratic complexity. Blocking is commonly employed to avoid that: profiles are grouped into blocks according to some features, and ER is performed only for entities of the same block. Yet, devising blocking criteria and ER algorithms for data with highly schema heterogeneity is a difficult and error-prone task calling for automatic methods and debugging tools. In our previous work, we presented Blast, an ER system that can scale practitioners' favorite Entity Resolution algorithms. In current version, Blast has been devised to take full advantage of parallel and distributed computation as well (running on top of Apache Spark). It implements the state-of-the-art unsupervised blocking method based on automatically extracted loose schema information. We build on top of blast a GUI (Graphic User Interface), which allows: (i) to visualize, understand, and (optionally) manually modify the loose schema information automatically extracted (i.e., injecting user's knowledge in the system); (ii) to retrieve resolved entities through a free-text search box, and to visualize the process that lead to that result (i.e., the provenance). Experimental results on real-world datasets show that these two functionalities can significantly enhance Entity Resolution results.
The Italian Clinical network for FSHD (ICNF) has established the Italian National Registry for FS... more The Italian Clinical network for FSHD (ICNF) has established the Italian National Registry for FSHD (INRF), collecting data from patients affected by Facioscapulohumeral dystrophy (FSHD) and their relatives. The INRF has gathered data from molecular analysis, clinical evaluation, anamnestic information, and family history from more than 3500 participants. A data management framework, called MOMIS FSHD Web Platform, has been developed to provide charts, maps and search tools customized for speci c needs. Patients' samples and their clinical information derives from the Italian Clinical network for FSHD (ICNF), a consortium consisting of fourteen neuromuscular clinics distributed across Italy. The tools used to collect, integrate, and visualize clinical, molecular and natural history information about patients affected by FSHD and their relatives are described. The INRF has collected the molecular data of FSHD conducted on 7197 subjects, and identi ed 3362 individuals carrying a DRA: 1634 are unrelated individuals, 602 isolated cases. In 1032 cases the molecular testing has been extended to 3747 relatives, 1728 carrying a DRA. Since 2009 molecular analysis has been accompanied by clinical evaluation based standardized evaluation protocols. In total 3577 clinical forms have been collected, 2059 follow the Comprehensive Clinical Evaluation form (CCEF). The integration of standardized clinical information and molecular data has made possible to demonstrate the wide phenotypic variability of FSHD. The MOMIS (Mediator Environment for Multiple Information Sources) data integration framework allowed performing genotype-phenotype correlation studies, and generated information of medical importance either for clinical practice or genetic counseling. The platform implemented for the FSHD Registry data collection based on OpenClinica meets the requirement to integrate patient/disease information, as well as the need to adapt dynamically to security and privacy concerns. Our results indicate that the quality of data collection in a multi-integrated approach is fundamental for clinical and epidemiological research in a rare disease and allows to rede ne diagnostic criteria and disease markers for FSHD.
Peer-to-peer (P2P) computing has attracted significant media attention, initiallyspurred by the p... more Peer-to-peer (P2P) computing has attracted significant media attention, initiallyspurred by the popularity of file-sharing systems such as Napster, Gnutella, andMorpheus.More recently systems like BitTorrent and eDonkey have continued tosustain that attention. New techniques such as distributed hash-tables (DHTs),semantic routing, and Plaxton Meshes are being combined with traditional conceptssuch as Hypercubes, Trust Metrics, and caching techniques to pool togetherthe untapped computing power at the \u201cedges\u201d of the Internet. These newtechniques and possibilities have generated a lot of interest in many industrialorganizations, and resulted in the creation of a P2P working group on standardizationin this area (http://www.irtf.org/charter?gtype=rg&group=p2prg).In P2P computing, peers and services forego central coordination and dynamicallyorganize themselves to support knowledge sharing and collaboration,in both cooperative and non-cooperative environments. The success of P2P systemsstrongly depends on a number of factors. First, the ability to ensure equitabledistribution of content and services. Economic and business models whichrely on incentive mechanisms to supply contributions to the system are beingdeveloped, along with methods for controlling the \u201cfree riding\u201d issue. Second,the ability to enforce provision of trusted services. Reputation-based P2P trustmanagement models are becoming a focus of the research community as a viablesolution. The trust models must balance both constraints imposed by theenvironment (e.g., scalability) and the unique properties of trust as a social andpsychological phenomenon. Recently, we are also witnessing a move of the P2Pparadigm to embrace mobile computing in an attempt to achieve even higherubiquitousness. The possibility of services related to physical location and therelation with agents in physical proximity could introduce new opportunities andalso new technical challenges.Although researchers working on distributed computing, multi-agent systems,databases, and networks have been using similar concepts for a long time, it isonly fairly recently that papers motivated by the current P2P paradigm havestarted appearing in high-quality conferences and workshops. Research in agentsystems in particular appears to be most relevant because, since their inception,multi-agent systems have always been thought of as collections of peers.The multi-agent paradigmcan thus be superimposed on the P2P architecture,where agents embody the description of the task environments, the decisionsupportcapabilities, the collective behavior, and the interaction protocols ofeach peer. The emphasis in this context on decentralization, user autonomy, dynamicgrowth, and other advantages of P2P also leads to significant potentialproblems. Most prominent among these problems are coordination: the abilityof an agent to make decisions on its own actions in the context of activitiesof other agents, and scalability: the value of the P2P systems lies in how well they scale along several dimensions, including complexity, heterogeneity of peers,robustness, traffic redistribution, and so forth. It is important to scale up coordinationstrategies along multiple dimensions to enhance their tractability andviability, and thereby to widen potential application domains. These two problemsare common to many large-scale applications.Without coordination, agentsmay be wasting their efforts, squandering, resources, and failing to achieve theirobjectives in situations requiring collective effort.This workshop series brings together researchers working on agent systemsand P2P computing with the intention of strengthening this connection. Researchersfrom other related areas such as distributed systems, networks, anddatabase systems are also welcome (and, in our opinion, have a lot to contribute).We sought high-quality and original contributions on the general themeof \u201cAgents and P2P Computing.\u201
In this paper we present the MOMIS Dashboard, an interactive data analytics tool built to explore... more In this paper we present the MOMIS Dashboard, an interactive data analytics tool built to explore and visualize contents from varied data sources through several dynamic views (e.g. maps, charts, pies, etc.). The software tool is very versatile, and supports connections to the major relational DBMS and Big Data sources. Moreover, it can be connected to MOMIS, a powerful Open Source Data Integration system, capable of integrating heterogeneous data sources such as enterprise information systems as well as sensors data. MOMIS Dashboard provides a secure permission management to limit data access on the basis of user roles, and a Designer to create and share personalized insights of the company KPIs, thereby facilitating enterprise collaboration. We will illustrate the MOMIS Dashboard efficacy in a real enterprise scenario: a production monitoring platform that analyzes real-time and historical data collected through sensors, located on machines, detects early marks of malfunctions or failures and sends alarm notifications to the supervisors, therefore optimizing both production and energy consumption, while ensuring preventive maintenance.
Entity Resolution is a crucial task for many applications, but its nave solution has a low effici... more Entity Resolution is a crucial task for many applications, but its nave solution has a low efficiency due to its quadratic complexity. Usually, to reduce this complexity, blocking is employed to cluster similar entities in order to reduce the global number of comparisons. Meta-Blocking (MB) approach aims to restructure the block collection in order to reduce the number of comparisons, obtaining better results in term of execution time. However, these techniques alone are not sufficient to work in the context of Big Data, where typically the records to be compared are in the order of hundreds of million. Parallel implementations of MB have been proposed in the literature, but all of them are built on Hadoop MapReduce, which is known to have a low efficiency on modern cluster architecture. We implement a Meta-Blocking technique for Apache Spark. Unlike Hadoop, Apache Spark uses a different paradigm to manage the tasks: it does not need to save the partial results on disk, keeping them in memory, which guarantees a shorter execution time. We reimplemented the state-of-the-art MB techniques, creating a new algorithm in order to exploit the Spark architecture. We tested our algorithm over several established datasets, showing that ours Spark implementation outperforms other existing ones based on Hadoop
Pushed by recent evolvements in the field of declarative networking and data-parallel computation... more Pushed by recent evolvements in the field of declarative networking and data-parallel computation, we propose a first investigation over a declarative imperative parallel programming model which tries to combine the two worlds. We identify a set of requirements that the model should possess and introduce a conceptual sketch of the system implementing the foresaw model
Duplicate detection aims to identify different records in data sources that refer to the same rea... more Duplicate detection aims to identify different records in data sources that refer to the same real-world entity. It is a fundamental task for: item catalogs fusion, customer databases integration, fraud detection, and more. In this work we present BigDedup, a toolkit able to detect duplicate records on Big Data sources in an efficient manner. BigDedup makes available the state-of-the-art duplicate detection techniques on Apache Spark, a modern framework for distributed computing in Big Data scenarios. It can be used in two different ways: (i) through a simple graphic interface that permit to the user to process structured and unstructured data in a fast and effective way; (ii) as a library that provides different components that can be easily extended and customized. In the paper we show how to use BigDedup and its usefulness through some industrial examples.
The terms and conditions for the reuse of this version of the manuscript are specified in the pub... more The terms and conditions for the reuse of this version of the manuscript are specified in the publishing policy. For all terms of use and more information see the publisher's website.
Electronic commerce lets people purchase goods and exchange information on business transactions ... more Electronic commerce lets people purchase goods and exchange information on business transactions on-line. Therefore one of the main challenges for the designers of the e-commerce infrastructures is the information sharing, retrieving data located in different sources thus obtaining an integrated view to overcome any contradiction or redundancy. Virtual Catalogs synthesize this approach as they are conceived as instruments to dynamically retrieve information from multiple catalogs and present product data in a unified manner, without directly storing product data from catalogs. In this paper we propose SI-Designer, a support tool for the integration of data from structured and semi-structured data sources, developed within the MOMIS (Mediator environment for Multiple Information Sources) project.
Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entit... more Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entity. ER is typically employed as an expensive cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire, continuously-growing data. Similarly, when querying data lakes, we want to transform data ondemand and return the results in a timely manner-a fundamental requirement of ELT (Extract-Load-Transform) pipelines. We propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, following an ORDER BY predicate. Thus, it inherently supports top-k and stop-and-resume execution. For a wide range of applications, a significant amount of resources can be saved. We exhaustively evaluate and show the efficacy of BrewER on four real-world datasets.
The Linked Data Principles defined by Tim-Berners Lee promise that a large portion of Web Data wi... more The Linked Data Principles defined by Tim-Berners Lee promise that a large portion of Web Data will be usable as one big interlinked RDF database. Today, with more than one thousand of Linked Open Data (LOD) sources available on the Web, we are assisting to an emerging trend in publication and consumption of LOD datasets. However, the pervasive use of external resources together with a deficiency in the definition of the internal structure of a dataset causes many LOD sources are extremely complex to understand. In this paper, we describe a formal method to unveil the implicit structure of a LOD dataset by building a (Clustered) Schema Summary. The Schema Summary contains all the main classes and properties used within the datasets, whether they are taken from external vocabularies or not, and is conceivable as an RDFS ontology. The Clustered Schema Summary, suitable for large LOD datasets, provides a more high level view of the classes and the properties used by gathering together classes that are object of multiple instantiations.
Record-level matching rules are chains of similarity join predicates on multiple attributes emplo... more Record-level matching rules are chains of similarity join predicates on multiple attributes employed to join records that refer to the same real-world object when an explicit foreign key is not available on the data sets at hand. They are widely employed by data scientists and practitioners that work with data lakes, open data, and data in the wild. In this work we present a novel technique that allows to efficiently execute record-level matching rules on parallel and distributed systems and demonstrate its efficiency on a real-wold data set.
Entity Resolution and Data Fusion are fundamental tasks in a Data Integration process. Unfortunat... more Entity Resolution and Data Fusion are fundamental tasks in a Data Integration process. Unfortunately, these tasks cannot be completely addressed by purely automated methods and, then, a "humanin-the-loop" approach, i.e., the interaction with the Integration Designer has to be considered. In fact, the application goal can be relevant to reduce the complexity and the cost of the whole integration process. Moreover, the Entity Resolution and Data Fusion tasks are often considered consecutive and independent of each other: the output of the first step is used as input of the second one. In this paper, we will show how these tasks have not to be considered independent. In fact, the evaluation of data fusion results is fundamental for the Integration Designer to analyze, and eventually modify, the choices made during the Entity Resolution process. To show this, our highly scalable Entity Resolution tool, SparkER, will be extended with post-processing high-quality methods for matching. These methods will be integrated in the MOMIS Data Fusion system, extended as well with metrics for the evaluation of data fusion results.
Entity Resolution (ER) is a fundamental task of data integration: it identifies different represe... more Entity Resolution (ER) is a fundamental task of data integration: it identifies different representations (i.e., profiles) of the same real-world entity in databases. To compare all possible profile pairs through an ER algorithm has a quadratic complexity. Blocking is commonly employed to avoid that: profiles are grouped into blocks according to some features, and ER is performed only for entities of the same block. Yet, devising blocking criteria and ER algorithms for data with highly schema heterogeneity is a difficult and error-prone task calling for automatic methods and debugging tools. In our previous work, we presented Blast, an ER system that can scale practitioners' favorite Entity Resolution algorithms. In current version, Blast has been devised to take full advantage of parallel and distributed computation as well (running on top of Apache Spark). It implements the state-of-the-art unsupervised blocking method based on automatically extracted loose schema information. We build on top of blast a GUI (Graphic User Interface), which allows: (i) to visualize, understand, and (optionally) manually modify the loose schema information automatically extracted (i.e., injecting user's knowledge in the system); (ii) to retrieve resolved entities through a free-text search box, and to visualize the process that lead to that result (i.e., the provenance). Experimental results on real-world datasets show that these two functionalities can significantly enhance Entity Resolution results.
The Italian Clinical network for FSHD (ICNF) has established the Italian National Registry for FS... more The Italian Clinical network for FSHD (ICNF) has established the Italian National Registry for FSHD (INRF), collecting data from patients affected by Facioscapulohumeral dystrophy (FSHD) and their relatives. The INRF has gathered data from molecular analysis, clinical evaluation, anamnestic information, and family history from more than 3500 participants. A data management framework, called MOMIS FSHD Web Platform, has been developed to provide charts, maps and search tools customized for speci c needs. Patients' samples and their clinical information derives from the Italian Clinical network for FSHD (ICNF), a consortium consisting of fourteen neuromuscular clinics distributed across Italy. The tools used to collect, integrate, and visualize clinical, molecular and natural history information about patients affected by FSHD and their relatives are described. The INRF has collected the molecular data of FSHD conducted on 7197 subjects, and identi ed 3362 individuals carrying a DRA: 1634 are unrelated individuals, 602 isolated cases. In 1032 cases the molecular testing has been extended to 3747 relatives, 1728 carrying a DRA. Since 2009 molecular analysis has been accompanied by clinical evaluation based standardized evaluation protocols. In total 3577 clinical forms have been collected, 2059 follow the Comprehensive Clinical Evaluation form (CCEF). The integration of standardized clinical information and molecular data has made possible to demonstrate the wide phenotypic variability of FSHD. The MOMIS (Mediator Environment for Multiple Information Sources) data integration framework allowed performing genotype-phenotype correlation studies, and generated information of medical importance either for clinical practice or genetic counseling. The platform implemented for the FSHD Registry data collection based on OpenClinica meets the requirement to integrate patient/disease information, as well as the need to adapt dynamically to security and privacy concerns. Our results indicate that the quality of data collection in a multi-integrated approach is fundamental for clinical and epidemiological research in a rare disease and allows to rede ne diagnostic criteria and disease markers for FSHD.
Peer-to-peer (P2P) computing has attracted significant media attention, initiallyspurred by the p... more Peer-to-peer (P2P) computing has attracted significant media attention, initiallyspurred by the popularity of file-sharing systems such as Napster, Gnutella, andMorpheus.More recently systems like BitTorrent and eDonkey have continued tosustain that attention. New techniques such as distributed hash-tables (DHTs),semantic routing, and Plaxton Meshes are being combined with traditional conceptssuch as Hypercubes, Trust Metrics, and caching techniques to pool togetherthe untapped computing power at the \u201cedges\u201d of the Internet. These newtechniques and possibilities have generated a lot of interest in many industrialorganizations, and resulted in the creation of a P2P working group on standardizationin this area (http://www.irtf.org/charter?gtype=rg&group=p2prg).In P2P computing, peers and services forego central coordination and dynamicallyorganize themselves to support knowledge sharing and collaboration,in both cooperative and non-cooperative environments. The success of P2P systemsstrongly depends on a number of factors. First, the ability to ensure equitabledistribution of content and services. Economic and business models whichrely on incentive mechanisms to supply contributions to the system are beingdeveloped, along with methods for controlling the \u201cfree riding\u201d issue. Second,the ability to enforce provision of trusted services. Reputation-based P2P trustmanagement models are becoming a focus of the research community as a viablesolution. The trust models must balance both constraints imposed by theenvironment (e.g., scalability) and the unique properties of trust as a social andpsychological phenomenon. Recently, we are also witnessing a move of the P2Pparadigm to embrace mobile computing in an attempt to achieve even higherubiquitousness. The possibility of services related to physical location and therelation with agents in physical proximity could introduce new opportunities andalso new technical challenges.Although researchers working on distributed computing, multi-agent systems,databases, and networks have been using similar concepts for a long time, it isonly fairly recently that papers motivated by the current P2P paradigm havestarted appearing in high-quality conferences and workshops. Research in agentsystems in particular appears to be most relevant because, since their inception,multi-agent systems have always been thought of as collections of peers.The multi-agent paradigmcan thus be superimposed on the P2P architecture,where agents embody the description of the task environments, the decisionsupportcapabilities, the collective behavior, and the interaction protocols ofeach peer. The emphasis in this context on decentralization, user autonomy, dynamicgrowth, and other advantages of P2P also leads to significant potentialproblems. Most prominent among these problems are coordination: the abilityof an agent to make decisions on its own actions in the context of activitiesof other agents, and scalability: the value of the P2P systems lies in how well they scale along several dimensions, including complexity, heterogeneity of peers,robustness, traffic redistribution, and so forth. It is important to scale up coordinationstrategies along multiple dimensions to enhance their tractability andviability, and thereby to widen potential application domains. These two problemsare common to many large-scale applications.Without coordination, agentsmay be wasting their efforts, squandering, resources, and failing to achieve theirobjectives in situations requiring collective effort.This workshop series brings together researchers working on agent systemsand P2P computing with the intention of strengthening this connection. Researchersfrom other related areas such as distributed systems, networks, anddatabase systems are also welcome (and, in our opinion, have a lot to contribute).We sought high-quality and original contributions on the general themeof \u201cAgents and P2P Computing.\u201
In this paper we present the MOMIS Dashboard, an interactive data analytics tool built to explore... more In this paper we present the MOMIS Dashboard, an interactive data analytics tool built to explore and visualize contents from varied data sources through several dynamic views (e.g. maps, charts, pies, etc.). The software tool is very versatile, and supports connections to the major relational DBMS and Big Data sources. Moreover, it can be connected to MOMIS, a powerful Open Source Data Integration system, capable of integrating heterogeneous data sources such as enterprise information systems as well as sensors data. MOMIS Dashboard provides a secure permission management to limit data access on the basis of user roles, and a Designer to create and share personalized insights of the company KPIs, thereby facilitating enterprise collaboration. We will illustrate the MOMIS Dashboard efficacy in a real enterprise scenario: a production monitoring platform that analyzes real-time and historical data collected through sensors, located on machines, detects early marks of malfunctions or failures and sends alarm notifications to the supervisors, therefore optimizing both production and energy consumption, while ensuring preventive maintenance.
Entity Resolution is a crucial task for many applications, but its nave solution has a low effici... more Entity Resolution is a crucial task for many applications, but its nave solution has a low efficiency due to its quadratic complexity. Usually, to reduce this complexity, blocking is employed to cluster similar entities in order to reduce the global number of comparisons. Meta-Blocking (MB) approach aims to restructure the block collection in order to reduce the number of comparisons, obtaining better results in term of execution time. However, these techniques alone are not sufficient to work in the context of Big Data, where typically the records to be compared are in the order of hundreds of million. Parallel implementations of MB have been proposed in the literature, but all of them are built on Hadoop MapReduce, which is known to have a low efficiency on modern cluster architecture. We implement a Meta-Blocking technique for Apache Spark. Unlike Hadoop, Apache Spark uses a different paradigm to manage the tasks: it does not need to save the partial results on disk, keeping them in memory, which guarantees a shorter execution time. We reimplemented the state-of-the-art MB techniques, creating a new algorithm in order to exploit the Spark architecture. We tested our algorithm over several established datasets, showing that ours Spark implementation outperforms other existing ones based on Hadoop
Pushed by recent evolvements in the field of declarative networking and data-parallel computation... more Pushed by recent evolvements in the field of declarative networking and data-parallel computation, we propose a first investigation over a declarative imperative parallel programming model which tries to combine the two worlds. We identify a set of requirements that the model should possess and introduce a conceptual sketch of the system implementing the foresaw model
Duplicate detection aims to identify different records in data sources that refer to the same rea... more Duplicate detection aims to identify different records in data sources that refer to the same real-world entity. It is a fundamental task for: item catalogs fusion, customer databases integration, fraud detection, and more. In this work we present BigDedup, a toolkit able to detect duplicate records on Big Data sources in an efficient manner. BigDedup makes available the state-of-the-art duplicate detection techniques on Apache Spark, a modern framework for distributed computing in Big Data scenarios. It can be used in two different ways: (i) through a simple graphic interface that permit to the user to process structured and unstructured data in a fast and effective way; (ii) as a library that provides different components that can be easily extended and customized. In the paper we show how to use BigDedup and its usefulness through some industrial examples.
The terms and conditions for the reuse of this version of the manuscript are specified in the pub... more The terms and conditions for the reuse of this version of the manuscript are specified in the publishing policy. For all terms of use and more information see the publisher's website.
Electronic commerce lets people purchase goods and exchange information on business transactions ... more Electronic commerce lets people purchase goods and exchange information on business transactions on-line. Therefore one of the main challenges for the designers of the e-commerce infrastructures is the information sharing, retrieving data located in different sources thus obtaining an integrated view to overcome any contradiction or redundancy. Virtual Catalogs synthesize this approach as they are conceived as instruments to dynamically retrieve information from multiple catalogs and present product data in a unified manner, without directly storing product data from catalogs. In this paper we propose SI-Designer, a support tool for the integration of data from structured and semi-structured data sources, developed within the MOMIS (Mediator environment for Multiple Information Sources) project.
Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entit... more Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entity. ER is typically employed as an expensive cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire, continuously-growing data. Similarly, when querying data lakes, we want to transform data ondemand and return the results in a timely manner-a fundamental requirement of ELT (Extract-Load-Transform) pipelines. We propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, following an ORDER BY predicate. Thus, it inherently supports top-k and stop-and-resume execution. For a wide range of applications, a significant amount of resources can be saved. We exhaustively evaluate and show the efficacy of BrewER on four real-world datasets.
The Linked Data Principles defined by Tim-Berners Lee promise that a large portion of Web Data wi... more The Linked Data Principles defined by Tim-Berners Lee promise that a large portion of Web Data will be usable as one big interlinked RDF database. Today, with more than one thousand of Linked Open Data (LOD) sources available on the Web, we are assisting to an emerging trend in publication and consumption of LOD datasets. However, the pervasive use of external resources together with a deficiency in the definition of the internal structure of a dataset causes many LOD sources are extremely complex to understand. In this paper, we describe a formal method to unveil the implicit structure of a LOD dataset by building a (Clustered) Schema Summary. The Schema Summary contains all the main classes and properties used within the datasets, whether they are taken from external vocabularies or not, and is conceivable as an RDFS ontology. The Clustered Schema Summary, suitable for large LOD datasets, provides a more high level view of the classes and the properties used by gathering together classes that are object of multiple instantiations.
Uploads
Papers by Sonia BERGAMASCHI