The linguistic and social impact of multiculturalism can no longer be neglected in any sector, cr... more The linguistic and social impact of multiculturalism can no longer be neglected in any sector, creating the urgent need of creating systems and procedures for managing and sharing cultural heritages in both supranational and multi-literate contexts. In order to achieve this goal, text sensing appears to be one of the most crucial research areas. The long-term objective of the DigitalMaktaba project, born from interdisciplinary collaboration between computer scientists, historians, librarians, engineers and linguists, is to establish procedures for the creation, management and cataloguing of archival heritage in non-Latin alphabets. In this paper, we discuss the currently ongoing design of an innovative workflow and tool in the area of text sensing, for the automatic extraction of knowledge and cataloguing of documents written in non-Latin languages (Arabic, Persian and Azerbaijani). The current prototype leverages different OCR, text processing and information extraction techniques ...
Ajith Abraham, Machine Intelligence Research Labs (MIR Labs), USA Xiangdong An, York University, ... more Ajith Abraham, Machine Intelligence Research Labs (MIR Labs), USA Xiangdong An, York University, Canada Maristella Agosti, University of Padua, Italy Reda Alhajj, University of Calgary, Canada Luis Antunes, Universidade de Lisboa, Portugal Helen Ashman, University of South Australia, Australia Ching Man Au Yeung, NTT Communication Science Laboratories, Japan Leif Azzopardi, University of Glasgow, UK Ebrahim Bagheri, National Research Council of Canada, Canada Mert Bal, Yildiz Technical University, Turkey Michel ...
In the last twenty years, many data integration systems following a classical wrapper/mediator ar... more In the last twenty years, many data integration systems following a classical wrapper/mediator architecture and providing a Global Virtual Schema (a.k.a. Global Virtual View - GVV) have been proposed by the research community. The main issues faced by these approaches range from system-level heterogeneities, to structural syntax level heterogeneities at the semantic level. Despite the research effort, all the approaches proposed require a lot of user intervention for customizing and managing the data integration and reconciliation tasks. In some cases, the effort and the complexity of the task is huge, since it requires the development of specific programming codes. Unfortunately, due to the specificity to be addressed, application codes and solutions are not frequently reusable in other domains. For this reason, the Lowell Report 2005 has provided the guideline for the definition of a public benchmark for information integration problem. The proposal, called THALIA (Test Harness fo...
International Conference on Information Society (i-Society 2011), 2011
The composition of multimedia presentations is a time- and resource-consuming task if not afforde... more The composition of multimedia presentations is a time- and resource-consuming task if not afforded in a well-defined manner. This is particularly true when people having different roles and following different high-level directives, collaborate in the authoring and assembling of a final product. For this reason we adopt the Select, Assemble, Transform and Present (SATP) approach to coordinate the presentation authoring and a tag cloud-based search engine in order to help users in efficiently retrieving useful assets. In this paper we present MediaPresenter, the framework we developed to support companies in the creation of multimedia communication means, providing an instrument that users can exploit every time new communication channels have to be created.
We present SparkER, an ER tool that can scale practitioners’ favorite ER algorithms. SparkER has ... more We present SparkER, an ER tool that can scale practitioners’ favorite ER algorithms. SparkER has been devised to take full advantage of parallel and distributed computation as well (running on top of Apache Spark). The first SparkER version was focused on the blocking step and implements both schema-agnostic and Blast meta-blocking approaches (i.e. the state-of-the-art ones); a GUI for SparkER, to let non-expert users to use it in an unsupervised mode, was developed. The new version of SparkER to be shown in this demo, extends significantly the tool. Entity matching and Entity Clustering modules have been added. Moreover, in addition to the completely unsupervised mode of the first version, a supervised mode has been added. The user can be assisted in supervising the entire process and in injecting his knowledge in order to achieve the best result. During the demonstration, attendees will be shown how SparkER can significantly help in devising and debugging ER algorithms.
The Database Group (DBGroup, www.dbgroup.unimore.it) and Information System Group (ISGroup, www.i... more The Database Group (DBGroup, www.dbgroup.unimore.it) and Information System Group (ISGroup, www.isgroup.unimore.it) research activities have been mainly devoted to the Data Integration Reserach Area. The DBGroup designed and developed the MOMIS data integration system, giving raise to a successful innovative enterprise DataRiver (www.datariver.it), distributing MOMIS as open source. MOMIS provides an integrated access to structured and semistructured data sources and allows a user to pose a single query and to receive a single unified answer. Description Logics, Automatic Annotation of schemata plus clustering techniques constitute the theoretical framework. In the context of data integration, the ISGroup addressed problems related to the management and querying of heterogeneous data sources in large-scale and dynamic scenarios. The reference architectures are the Peer Data Management Systems and its evolutions toward dataspaces. In these contexts, the ISGroup proposed and evaluated effective and efficient mechanisms for network creation with limited information loss and solutions for mapping management query reformulation and processing and query routing. The main issues of data integration have been faced: automatic annotation, mapping discovery, global query processing, provenance, multidimensional Information integration, keyword search, within European and national projects. With the incoming new requirements of integrating open linked data, textual and multimedia data in a big data scenario, the research has been devoted to the Big Data Integration Research Area. In particular, the most relevant achieved research results are: a scalable entity resolution method, a scalable join operator and a tool, LODEX, for automatically extracting metadata from Linked Open Data (LOD) resources and for visual querying formulation on LOD resources. Moreover, in collaboration with DATARIVER, Data Integration was successfully applied to smart e-health.
Record-level matching rules are chains of similarity join predicates on multiple attributes emplo... more Record-level matching rules are chains of similarity join predicates on multiple attributes employed to join records that refer to the same real-world object when an explicit foreign key is not available on the data sets at hand. They are widely employed from data scientists and practitioners that work with data lakes, open data, and data in the wild. We present RulER, the first tool that allows to efficiently execute record-level matching rules on parallel and distributed systems—we developed that on top of Apache Spark to leverage on its vast ecosystem of libraries and wide adoption. In this demo, we show how RulER can be easily employed for data preparation tasks (i.e., to join data sets to be consumed by data analytic tasks) and to support the user in debugging record-level matching rules. Finally, we demonstrate how our execution strategy of the record-level matching rules—introduced by RulER—is up to 3 times faster than the naïve concatenation of similarity join predicates.
We propose a novel Named Entity Recognition (NER) system based on a machine learning technique an... more We propose a novel Named Entity Recognition (NER) system based on a machine learning technique and a semantic network. The NER system is able to exploit the advantages of semantic information, coming from Expert System proprietary technology, Cogito. NER is a task of Natural Language Processing (NLP) which consists in detecting, from an unformatted text source and classify, Named Entities (NE), i.e. real-world entities that can be denoted with a rigid designator. To address this problem, the chosen approach is a combination of machine learning and deep semantic processing. The machine learning method used is Conditional Random Fields (CRF). CRF is particularly suitable for the task because it analyzes an input sequence considering the whole sequence, instead of one item at a time. CRF has been trained not only with classical information, available after a simple computation or anyway with little effort, but with semantic information too. Semantic information is obtained with Sensigr...
This paper reports the runner-up solution to the ACM SIGMOD 2020 programming contest, whose targe... more This paper reports the runner-up solution to the ACM SIGMOD 2020 programming contest, whose target was to identify the specifications (i.e., records) collected across 24 e-commerce data sources that refer to the same real-world entities. First, we investigate the machine learning (ML) approach, but surprisingly find that existing state-of-the-art ML-based methods fall short in such a context—not reaching 0.49 F-score. Then, we propose an efficient solution that exploits annotated lists and regular expressions generated by humans that reaches a 0.99 F-score. In our experience, our approach was not more expensive than the dataset labeling of match/non-match pairs required by ML-based methods, in terms of human efforts.
During the last decades, the Data Warehouse has been one of the main components of a Decision Sup... more During the last decades, the Data Warehouse has been one of the main components of a Decision Support System (DSS) inside a company. Given the great diffusion of Data Warehouses nowadays, managers have realized that there is a great potential in combining information coming from multiple information sources, like heterogeneous Data Warehouses from companies operating in the same sector. Existing solutions rely mostly on the Extract-Transform-Load (ETL) approach, a costly and complex process. The process of Data Warehous integration can be greatly simplified by developing a method that is able to semi-automatically discover semantic relationships among attributes of two or more different, heterogeneous Data Warehouse schemas. In this paper, we propose a method for the semi-automatic discovery of mappings between dimension hierarchies of heterogeneous Data Warehouses. Our approach exploits techniques from the Data Integration research area by combining topological properties of dimensions and semantic techniques
Keyword-search systems for databases aim to answer a user query composed of a few terms with a ra... more Keyword-search systems for databases aim to answer a user query composed of a few terms with a ranked list of records. They are powerful and easy-to-use data exploration tools for a wide range of contexts. For instance, given a product database gathered scraping e-commerce websites, these systems enable even non-technical users to explore the item set (e.g., to check whether it contains certain products or not, or to discover the price of an item). However, if the database contains dirty records (i.e., incomplete and duplicated records), a preprocessing step to clean the data is required. One fundamental data cleaning step is Entity Resolution, i.e., the task of identifying and fusing together all the records that refer to the same real-word entity. This task is typically executed on the whole data, independently of: (i) the portion of the entities that a user may indicate through keywords, and (ii) the order priority that a user might express through an order by clause. This paper ...
The linguistic and social impact of multiculturalism can no longer be neglected in any sector, cr... more The linguistic and social impact of multiculturalism can no longer be neglected in any sector, creating the urgent need of creating systems and procedures for managing and sharing cultural heritages in both supranational and multi-literate contexts. In order to achieve this goal, text sensing appears to be one of the most crucial research areas. The long-term objective of the DigitalMaktaba project, born from interdisciplinary collaboration between computer scientists, historians, librarians, engineers and linguists, is to establish procedures for the creation, management and cataloguing of archival heritage in non-Latin alphabets. In this paper, we discuss the currently ongoing design of an innovative workflow and tool in the area of text sensing, for the automatic extraction of knowledge and cataloguing of documents written in non-Latin languages (Arabic, Persian and Azerbaijani). The current prototype leverages different OCR, text processing and information extraction techniques ...
Ajith Abraham, Machine Intelligence Research Labs (MIR Labs), USA Xiangdong An, York University, ... more Ajith Abraham, Machine Intelligence Research Labs (MIR Labs), USA Xiangdong An, York University, Canada Maristella Agosti, University of Padua, Italy Reda Alhajj, University of Calgary, Canada Luis Antunes, Universidade de Lisboa, Portugal Helen Ashman, University of South Australia, Australia Ching Man Au Yeung, NTT Communication Science Laboratories, Japan Leif Azzopardi, University of Glasgow, UK Ebrahim Bagheri, National Research Council of Canada, Canada Mert Bal, Yildiz Technical University, Turkey Michel ...
In the last twenty years, many data integration systems following a classical wrapper/mediator ar... more In the last twenty years, many data integration systems following a classical wrapper/mediator architecture and providing a Global Virtual Schema (a.k.a. Global Virtual View - GVV) have been proposed by the research community. The main issues faced by these approaches range from system-level heterogeneities, to structural syntax level heterogeneities at the semantic level. Despite the research effort, all the approaches proposed require a lot of user intervention for customizing and managing the data integration and reconciliation tasks. In some cases, the effort and the complexity of the task is huge, since it requires the development of specific programming codes. Unfortunately, due to the specificity to be addressed, application codes and solutions are not frequently reusable in other domains. For this reason, the Lowell Report 2005 has provided the guideline for the definition of a public benchmark for information integration problem. The proposal, called THALIA (Test Harness fo...
International Conference on Information Society (i-Society 2011), 2011
The composition of multimedia presentations is a time- and resource-consuming task if not afforde... more The composition of multimedia presentations is a time- and resource-consuming task if not afforded in a well-defined manner. This is particularly true when people having different roles and following different high-level directives, collaborate in the authoring and assembling of a final product. For this reason we adopt the Select, Assemble, Transform and Present (SATP) approach to coordinate the presentation authoring and a tag cloud-based search engine in order to help users in efficiently retrieving useful assets. In this paper we present MediaPresenter, the framework we developed to support companies in the creation of multimedia communication means, providing an instrument that users can exploit every time new communication channels have to be created.
We present SparkER, an ER tool that can scale practitioners’ favorite ER algorithms. SparkER has ... more We present SparkER, an ER tool that can scale practitioners’ favorite ER algorithms. SparkER has been devised to take full advantage of parallel and distributed computation as well (running on top of Apache Spark). The first SparkER version was focused on the blocking step and implements both schema-agnostic and Blast meta-blocking approaches (i.e. the state-of-the-art ones); a GUI for SparkER, to let non-expert users to use it in an unsupervised mode, was developed. The new version of SparkER to be shown in this demo, extends significantly the tool. Entity matching and Entity Clustering modules have been added. Moreover, in addition to the completely unsupervised mode of the first version, a supervised mode has been added. The user can be assisted in supervising the entire process and in injecting his knowledge in order to achieve the best result. During the demonstration, attendees will be shown how SparkER can significantly help in devising and debugging ER algorithms.
The Database Group (DBGroup, www.dbgroup.unimore.it) and Information System Group (ISGroup, www.i... more The Database Group (DBGroup, www.dbgroup.unimore.it) and Information System Group (ISGroup, www.isgroup.unimore.it) research activities have been mainly devoted to the Data Integration Reserach Area. The DBGroup designed and developed the MOMIS data integration system, giving raise to a successful innovative enterprise DataRiver (www.datariver.it), distributing MOMIS as open source. MOMIS provides an integrated access to structured and semistructured data sources and allows a user to pose a single query and to receive a single unified answer. Description Logics, Automatic Annotation of schemata plus clustering techniques constitute the theoretical framework. In the context of data integration, the ISGroup addressed problems related to the management and querying of heterogeneous data sources in large-scale and dynamic scenarios. The reference architectures are the Peer Data Management Systems and its evolutions toward dataspaces. In these contexts, the ISGroup proposed and evaluated effective and efficient mechanisms for network creation with limited information loss and solutions for mapping management query reformulation and processing and query routing. The main issues of data integration have been faced: automatic annotation, mapping discovery, global query processing, provenance, multidimensional Information integration, keyword search, within European and national projects. With the incoming new requirements of integrating open linked data, textual and multimedia data in a big data scenario, the research has been devoted to the Big Data Integration Research Area. In particular, the most relevant achieved research results are: a scalable entity resolution method, a scalable join operator and a tool, LODEX, for automatically extracting metadata from Linked Open Data (LOD) resources and for visual querying formulation on LOD resources. Moreover, in collaboration with DATARIVER, Data Integration was successfully applied to smart e-health.
Record-level matching rules are chains of similarity join predicates on multiple attributes emplo... more Record-level matching rules are chains of similarity join predicates on multiple attributes employed to join records that refer to the same real-world object when an explicit foreign key is not available on the data sets at hand. They are widely employed from data scientists and practitioners that work with data lakes, open data, and data in the wild. We present RulER, the first tool that allows to efficiently execute record-level matching rules on parallel and distributed systems—we developed that on top of Apache Spark to leverage on its vast ecosystem of libraries and wide adoption. In this demo, we show how RulER can be easily employed for data preparation tasks (i.e., to join data sets to be consumed by data analytic tasks) and to support the user in debugging record-level matching rules. Finally, we demonstrate how our execution strategy of the record-level matching rules—introduced by RulER—is up to 3 times faster than the naïve concatenation of similarity join predicates.
We propose a novel Named Entity Recognition (NER) system based on a machine learning technique an... more We propose a novel Named Entity Recognition (NER) system based on a machine learning technique and a semantic network. The NER system is able to exploit the advantages of semantic information, coming from Expert System proprietary technology, Cogito. NER is a task of Natural Language Processing (NLP) which consists in detecting, from an unformatted text source and classify, Named Entities (NE), i.e. real-world entities that can be denoted with a rigid designator. To address this problem, the chosen approach is a combination of machine learning and deep semantic processing. The machine learning method used is Conditional Random Fields (CRF). CRF is particularly suitable for the task because it analyzes an input sequence considering the whole sequence, instead of one item at a time. CRF has been trained not only with classical information, available after a simple computation or anyway with little effort, but with semantic information too. Semantic information is obtained with Sensigr...
This paper reports the runner-up solution to the ACM SIGMOD 2020 programming contest, whose targe... more This paper reports the runner-up solution to the ACM SIGMOD 2020 programming contest, whose target was to identify the specifications (i.e., records) collected across 24 e-commerce data sources that refer to the same real-world entities. First, we investigate the machine learning (ML) approach, but surprisingly find that existing state-of-the-art ML-based methods fall short in such a context—not reaching 0.49 F-score. Then, we propose an efficient solution that exploits annotated lists and regular expressions generated by humans that reaches a 0.99 F-score. In our experience, our approach was not more expensive than the dataset labeling of match/non-match pairs required by ML-based methods, in terms of human efforts.
During the last decades, the Data Warehouse has been one of the main components of a Decision Sup... more During the last decades, the Data Warehouse has been one of the main components of a Decision Support System (DSS) inside a company. Given the great diffusion of Data Warehouses nowadays, managers have realized that there is a great potential in combining information coming from multiple information sources, like heterogeneous Data Warehouses from companies operating in the same sector. Existing solutions rely mostly on the Extract-Transform-Load (ETL) approach, a costly and complex process. The process of Data Warehous integration can be greatly simplified by developing a method that is able to semi-automatically discover semantic relationships among attributes of two or more different, heterogeneous Data Warehouse schemas. In this paper, we propose a method for the semi-automatic discovery of mappings between dimension hierarchies of heterogeneous Data Warehouses. Our approach exploits techniques from the Data Integration research area by combining topological properties of dimensions and semantic techniques
Keyword-search systems for databases aim to answer a user query composed of a few terms with a ra... more Keyword-search systems for databases aim to answer a user query composed of a few terms with a ranked list of records. They are powerful and easy-to-use data exploration tools for a wide range of contexts. For instance, given a product database gathered scraping e-commerce websites, these systems enable even non-technical users to explore the item set (e.g., to check whether it contains certain products or not, or to discover the price of an item). However, if the database contains dirty records (i.e., incomplete and duplicated records), a preprocessing step to clean the data is required. One fundamental data cleaning step is Entity Resolution, i.e., the task of identifying and fusing together all the records that refer to the same real-word entity. This task is typically executed on the whole data, independently of: (i) the portion of the entities that a user may indicate through keywords, and (ii) the order priority that a user might express through an order by clause. This paper ...
Uploads
Papers by Sonia BERGAMASCHI