Laure Berti-Équille is a Full Professor of Computer Science at Aix-Marseille University (AMU), Polytech, LIF (CNRS) in Marseille, France. Before joining AMU, she was a research director (DR2) at IRD, the French Institute of Research for Development (2011-2017). She was a Senior Scientist at Qatar Computing Research Institute (QCRI/HBKU) in Qatar (2014-2017), a tenured Associate Professor at University of Rennes 1 in France (2000-2010), and a 2-years visiting researcher at AT&T Labs Research in NJ, USA (2007-2009). Her interests are at the intersection of large-scale data science, data analytics, and machine learning with a focus on data quality research. She is the author/co-author of more than 70 publications in international journals and conference or workshop proceedings, and two books . She initiated the first workshop editions on information and data quality in information systems (IQIS) and quality in databases (QDB) in conjunction with SIGMOD and VLDB respectively and the first French workshop on Data and Knowledge Quality in conjunction with EGC in France. Laure has been on the editorial board of of the ACM Journal on Data and Information Quality. She was PC chair of the International Conferences on Information Quality (ICIQ) in 2012 and 2016. Address: Parc Scientifique et Technologique de Luminy 163, avenue de Luminy - Case 901 F-13288 Marseille Cedex 9 France
International audienceAs data types and data structures change to keep up with evolving technolog... more International audienceAs data types and data structures change to keep up with evolving technologies and applications, data quality problems too have evolved and become more complex. Data streams, web logs, wikipedias, biomedical applications, video streams and social networking websites generate a mind boggling variety of data types. Data quality mining, the use of data mining to manage, measure and improve data quality, has focused mostly on addressing each category of data glitch separately as a static entity. In this tutorial we highlight new directions in data quality mining, particularly: (a) the applicability and effectiveness of the methodologies for various data types such as structured, semi-structured and stream data, (b) the detection of concomitant data glitches like the occurrence of outliers in data with missing values and duplicates (c) the design of sequential approaches to data quality mining, such as workflows composed of a sequence of tasks for data quality explo...
In this paper, we study the explainability of automated data cleaning pipelines and propose CLean... more In this paper, we study the explainability of automated data cleaning pipelines and propose CLeanEX, a solution that can generate explanations for the pipelines automatically selected by an automated cleaning system, given it can provide its corresponding cleaning pipeline search space. We propose meaningful explanatory features that are used to describe the pipelines and generate predicate-based explanation rules. We compute quality indicators for these explanations and propose a multi-objective optimization algorithm to select the optimal set of explanations for user-defined objectives. Preliminary experiments show the need for multi-objective optimization for the generation of high-quality explanations that can be either intrinsic to the single selected cleaning pipeline or relative to the other pipelines that were not selected by the automated cleaning system. We also show that CLeanEX is a promising step towards generating automatically insightful explanations, while catering t...
Abstract. In this demo, we present ADVISU (Anomaly and Dependency VI-SUalization), a powerful int... more Abstract. In this demo, we present ADVISU (Anomaly and Dependency VI-SUalization), a powerful interactive system for visual analytics from massive datasets. ADVISU efficiently computes different types of dependencies (FDs, CFDs) and detects data anomalies from databases of large size, i.e., up to several thousands of attributes and millions of records. Real-time and scalable computational methods have been implemented in ADVISU to ensure interactivity and the demonstration is intended to show how these methods scale up for realworld massive scientific datasets in astrophysical and oceanographic application domains. ADVISU provides the users informative and interactive graphical interfaces for visualizing data dependencies and anomalies. It enables the analysis to be refined interactively while recomputing the dependencies and anomalies in user selected subspaces with good performance. 1
Many data management applications, such as setting up Web po rtals, managing enterprise data, man... more Many data management applications, such as setting up Web po rtals, managing enterprise data, managing community data, and sha ring scientific data, require integrating data from multiple sources. Each of the se sources provides a set of values and different sources can often provide conflic ti g values. To present quality data to users, it is critical to resolve conflicts and discover values that reflect the real world; this task is called data fusion. This paper describes a novel approach that finds true values from conflicting information when there are a large number of sources, among which some may copy from others. We p res nt a case study on real-world data showing that the described algorit hm can significantly improve accuracy of truth discovery and is scalable when the re are a large number of data sources.
This demo proposes MeSQuaL, a system for profiling and checking data quality before further tasks... more This demo proposes MeSQuaL, a system for profiling and checking data quality before further tasks, such as data analytics and machine learning. MeSQuaL extends SQL for querying relational data with constraints on data quality and facilitates the verification of statistical tests. The system includes: (1) a query interpreter for SQuaL, the SQL-extended language we propose for declaring and querying data with data quality checks and statistical tests; (2) an extensible library of user-defined functions for profiling the data and computing various data quality indicators;and (3) a user interface for declaring data quality constraints,profiling data, monitoring data quality with SQuaL queries, and visualizing the results via data quality dashboards. We showcase our system in action with various scenarios on real-world datasets and show its usability for monitoring data quality over time and checking the quality of data on-demand
Les travaux actuels sur l'extraction de connaissances a partir des donnees (ECD) se focalisen... more Les travaux actuels sur l'extraction de connaissances a partir des donnees (ECD) se focalisent sur la recherche de regles interessantes dont on souhaite pouvoir qualifier l'interet ou le caractere exceptionnel, mais dont la validite depend bien evidemment de celle des donnees. En amont du processus d'ECD, il semble donc essentiel d'evaluer la qualite des donnees stockees dans les bases et entrepots de donnees afin de : (1) proposer aux utilisateurs une expertise critique de la qualite du contenu d'un systeme, (2) orienter l'extraction des connaissances en fonction d'un profil cible d'utilisateurs et de decideurs, (3) permettre a ceux-ci de relativiser la confiance qu'ils pourraient accorder aux donnees et aux regles extraites, et leur permettre ainsi de mieux en adapter leur usage, (4) assurer enfin la validite et l'interet des connaissances extraites a partir des donnees. Cet article fait une synthese de l'etat de l'art dans le domain...
In this paper, we study the problem of discovering join FDs, i.e., functional dependencies (FDs) ... more In this paper, we study the problem of discovering join FDs, i.e., functional dependencies (FDs) that hold on multiple joined tables. We leverage logical inference, selective mining, and sampling and show that we can discover most of the exact join FDs from the single tables participating to the join and avoid the full computation of the join result. We propose algorithms to speed-up the join FD discovery process and mine FDs on the fly only from necessary data partitions. We introduce JEDI (Join dEpendency DIscovery), our solution to discover join FDs without computation of the full join beforehand. Our experiments on a range of real-world and synthetic data demonstrate the benefits of our method over existing FD discovery methods that need to precompute the join results before discovering the FDs. We show that the performance depends on the cardinalities and coverage of the join attribute values: for join operations with low coverage, JEDI with selective mining outperforms the com...
In this paper, we propose to query XML documents with a quality-based recommendation of the resul... more In this paper, we propose to query XML documents with a quality-based recommendation of the results. The document quality is modeled as a set of (criterion, value) pairs collected in metadata sets, and are associated with the indexed XML documents. We implemented four basic operations to achieve quality recommendation: 1) annotation with metadata describing the documents quality, 2) indexing the documents, 3) matching queries and quality requirements , and 4) viewing the recommended parts of the documents. The quality requirements of each user are kept as individual quality profiles (called XPS files). Every XML document in the document database refers to a quality style sheets (called XQS files) which allow the specification of several matching strategies with rules that associate parts (sub-trees) of XML documents to user profile quality requirements. An algorithm is described for evaluation of the quality style sheets and user profiles in order to build an "adaptive quality ...
A fundamental problem in data fusion is to determine the veracity of multi-source data in order t... more A fundamental problem in data fusion is to determine the veracity of multi-source data in order to resolve conflicts. While previous work in truth discovery has proved to be useful in practice for specific settings, sources' behavior or data set characteristics, there has been limited systematic comparison of the competing methods in terms of efficiency, usability, and repeatability. We remedy this deficit by providing a comprehensive review of 12 state-of-the art algorithms for truth discovery. We provide reference implementations and an in-depth evaluation of the methods based on extensive experiments on synthetic and real-world data. We analyze aspects of the problem that have not been explicitly studied before, such as the impact of initialization and parameter setting, convergence, and scalability. We provide an experimental framework for extensively comparing the methods in a wide range of truth discovery scenarios where source coverage, numbers and distributions of confli...
2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763)
The administration of very large collections of images, accentuates the classical problems of ind... more The administration of very large collections of images, accentuates the classical problems of indexing and efficiently querying information. This paper describes a new method applied to very large still image databases that combines two data mining techniques: clustering and association rules mining in order to better organize image collections and to improve the performance of queries. The objective of our
International audienceAs data types and data structures change to keep up with evolving technolog... more International audienceAs data types and data structures change to keep up with evolving technologies and applications, data quality problems too have evolved and become more complex. Data streams, web logs, wikipedias, biomedical applications, video streams and social networking websites generate a mind boggling variety of data types. Data quality mining, the use of data mining to manage, measure and improve data quality, has focused mostly on addressing each category of data glitch separately as a static entity. In this tutorial we highlight new directions in data quality mining, particularly: (a) the applicability and effectiveness of the methodologies for various data types such as structured, semi-structured and stream data, (b) the detection of concomitant data glitches like the occurrence of outliers in data with missing values and duplicates (c) the design of sequential approaches to data quality mining, such as workflows composed of a sequence of tasks for data quality explo...
In this paper, we study the explainability of automated data cleaning pipelines and propose CLean... more In this paper, we study the explainability of automated data cleaning pipelines and propose CLeanEX, a solution that can generate explanations for the pipelines automatically selected by an automated cleaning system, given it can provide its corresponding cleaning pipeline search space. We propose meaningful explanatory features that are used to describe the pipelines and generate predicate-based explanation rules. We compute quality indicators for these explanations and propose a multi-objective optimization algorithm to select the optimal set of explanations for user-defined objectives. Preliminary experiments show the need for multi-objective optimization for the generation of high-quality explanations that can be either intrinsic to the single selected cleaning pipeline or relative to the other pipelines that were not selected by the automated cleaning system. We also show that CLeanEX is a promising step towards generating automatically insightful explanations, while catering t...
Abstract. In this demo, we present ADVISU (Anomaly and Dependency VI-SUalization), a powerful int... more Abstract. In this demo, we present ADVISU (Anomaly and Dependency VI-SUalization), a powerful interactive system for visual analytics from massive datasets. ADVISU efficiently computes different types of dependencies (FDs, CFDs) and detects data anomalies from databases of large size, i.e., up to several thousands of attributes and millions of records. Real-time and scalable computational methods have been implemented in ADVISU to ensure interactivity and the demonstration is intended to show how these methods scale up for realworld massive scientific datasets in astrophysical and oceanographic application domains. ADVISU provides the users informative and interactive graphical interfaces for visualizing data dependencies and anomalies. It enables the analysis to be refined interactively while recomputing the dependencies and anomalies in user selected subspaces with good performance. 1
Many data management applications, such as setting up Web po rtals, managing enterprise data, man... more Many data management applications, such as setting up Web po rtals, managing enterprise data, managing community data, and sha ring scientific data, require integrating data from multiple sources. Each of the se sources provides a set of values and different sources can often provide conflic ti g values. To present quality data to users, it is critical to resolve conflicts and discover values that reflect the real world; this task is called data fusion. This paper describes a novel approach that finds true values from conflicting information when there are a large number of sources, among which some may copy from others. We p res nt a case study on real-world data showing that the described algorit hm can significantly improve accuracy of truth discovery and is scalable when the re are a large number of data sources.
This demo proposes MeSQuaL, a system for profiling and checking data quality before further tasks... more This demo proposes MeSQuaL, a system for profiling and checking data quality before further tasks, such as data analytics and machine learning. MeSQuaL extends SQL for querying relational data with constraints on data quality and facilitates the verification of statistical tests. The system includes: (1) a query interpreter for SQuaL, the SQL-extended language we propose for declaring and querying data with data quality checks and statistical tests; (2) an extensible library of user-defined functions for profiling the data and computing various data quality indicators;and (3) a user interface for declaring data quality constraints,profiling data, monitoring data quality with SQuaL queries, and visualizing the results via data quality dashboards. We showcase our system in action with various scenarios on real-world datasets and show its usability for monitoring data quality over time and checking the quality of data on-demand
Les travaux actuels sur l'extraction de connaissances a partir des donnees (ECD) se focalisen... more Les travaux actuels sur l'extraction de connaissances a partir des donnees (ECD) se focalisent sur la recherche de regles interessantes dont on souhaite pouvoir qualifier l'interet ou le caractere exceptionnel, mais dont la validite depend bien evidemment de celle des donnees. En amont du processus d'ECD, il semble donc essentiel d'evaluer la qualite des donnees stockees dans les bases et entrepots de donnees afin de : (1) proposer aux utilisateurs une expertise critique de la qualite du contenu d'un systeme, (2) orienter l'extraction des connaissances en fonction d'un profil cible d'utilisateurs et de decideurs, (3) permettre a ceux-ci de relativiser la confiance qu'ils pourraient accorder aux donnees et aux regles extraites, et leur permettre ainsi de mieux en adapter leur usage, (4) assurer enfin la validite et l'interet des connaissances extraites a partir des donnees. Cet article fait une synthese de l'etat de l'art dans le domain...
In this paper, we study the problem of discovering join FDs, i.e., functional dependencies (FDs) ... more In this paper, we study the problem of discovering join FDs, i.e., functional dependencies (FDs) that hold on multiple joined tables. We leverage logical inference, selective mining, and sampling and show that we can discover most of the exact join FDs from the single tables participating to the join and avoid the full computation of the join result. We propose algorithms to speed-up the join FD discovery process and mine FDs on the fly only from necessary data partitions. We introduce JEDI (Join dEpendency DIscovery), our solution to discover join FDs without computation of the full join beforehand. Our experiments on a range of real-world and synthetic data demonstrate the benefits of our method over existing FD discovery methods that need to precompute the join results before discovering the FDs. We show that the performance depends on the cardinalities and coverage of the join attribute values: for join operations with low coverage, JEDI with selective mining outperforms the com...
In this paper, we propose to query XML documents with a quality-based recommendation of the resul... more In this paper, we propose to query XML documents with a quality-based recommendation of the results. The document quality is modeled as a set of (criterion, value) pairs collected in metadata sets, and are associated with the indexed XML documents. We implemented four basic operations to achieve quality recommendation: 1) annotation with metadata describing the documents quality, 2) indexing the documents, 3) matching queries and quality requirements , and 4) viewing the recommended parts of the documents. The quality requirements of each user are kept as individual quality profiles (called XPS files). Every XML document in the document database refers to a quality style sheets (called XQS files) which allow the specification of several matching strategies with rules that associate parts (sub-trees) of XML documents to user profile quality requirements. An algorithm is described for evaluation of the quality style sheets and user profiles in order to build an "adaptive quality ...
A fundamental problem in data fusion is to determine the veracity of multi-source data in order t... more A fundamental problem in data fusion is to determine the veracity of multi-source data in order to resolve conflicts. While previous work in truth discovery has proved to be useful in practice for specific settings, sources' behavior or data set characteristics, there has been limited systematic comparison of the competing methods in terms of efficiency, usability, and repeatability. We remedy this deficit by providing a comprehensive review of 12 state-of-the art algorithms for truth discovery. We provide reference implementations and an in-depth evaluation of the methods based on extensive experiments on synthetic and real-world data. We analyze aspects of the problem that have not been explicitly studied before, such as the impact of initialization and parameter setting, convergence, and scalability. We provide an experimental framework for extensively comparing the methods in a wide range of truth discovery scenarios where source coverage, numbers and distributions of confli...
2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763)
The administration of very large collections of images, accentuates the classical problems of ind... more The administration of very large collections of images, accentuates the classical problems of indexing and efficiently querying information. This paper describes a new method applied to very large still image databases that combines two data mining techniques: clustering and association rules mining in order to better organize image collections and to improve the performance of queries. The objective of our
On the Web, a massive amount of user-generated content is available through various channels
(e.g... more On the Web, a massive amount of user-generated content is available through various channels (e.g., texts, tweets, Web tables, databases, multimedia-sharing platforms, etc.). Conflicting information, rumors, erroneous and fake content can be easily spread across multiple sources, making it hard to distinguish between what is true and what is not. This book gives an overview of fundamental issues and recent contributions for ascertaining the veracity of data in the era of Big Data. The text is organized into six chapters, focusing on structured data extracted from texts. Chapter 1 introduces the problem of ascertaining the veracity of data in a multi-source and evolving context. Issues related to information extraction are presented in Chapter 2. Current truth discovery computation algorithms are presented in details in Chapter 3. It is followed by practical techniques for evaluating data source reputation and authoritativeness in Chapter 4. The theoretical foundations and various approaches for modeling diffusion phenomenon of misinformation spreading in networked systems is studied in Chapter 5. Finally, truth discovery computation from extracted data in a dynamic context of misinformation propagation raises interesting challenges that are explored in Chapter 6. This text is intended for a seminar course at the graduate level. It is also to serve as a useful resource for researchers and practitioners who are interested in the study of fact-checking, truth discovery, or rumor spreading.
With the emergence of machine learning (ML) techniques in database research, ML has already prove... more With the emergence of machine learning (ML) techniques in database research, ML has already proved a tremendous potential to dramatically impact the foundations, algorithms, and models of several data management tasks, such as error detection, data cleaning, data integration, and query inference. Part of the data preparation, standardization, and cleaning processes, such as data matching and deduplication for instance, could be automated by making a ML model “learn” and predict the matches routinely. Data integration can also benefit from ML as the data to be integrated can be sampled and used to design the data integration algorithms. After the initial manual work to setup the labels, ML models can start learning from the new incoming data that are being submitted for standardization, integration, and cleaning. The more data supplied to the model, the better the ML algorithm can perform and deliver accurate results. Therefore, ML is more scalable compared to traditional and time-consuming approaches. Nevertheless, many ML algorithms require an out of the box tuning and their parameters and scope are often not adapted to the problem at hand. To make an example, in cleaning and integration processes, the window sizes of values used for the ML models cannot be arbitrarily chosen and require an adaptation of the learning parameters. This tutorial will survey the recent trend of applying machine learning solutions to improve data management tasks and establish new paradigms to sharpen data error detection, cleaning, and integration at the data instance level, as well as at schema, system, and user levels.
With the emergence of machine learning (ML) techniques in database research, ML has already prove... more With the emergence of machine learning (ML) techniques in database research, ML has already proved a tremendous potential to dramatically impact the foundations, algorithms, and models of several data management tasks, such as error detection, data cleaning, data integration, and query inference. Part of the data preparation, standardization, and cleaning processes, such as data matching and deduplication for instance, could be automated by making a ML model “learn” and predict the matches routinely. Data integration can also benefit from ML as the data to be integrated can be sampled and used to design the data integration algorithms. After the initial manual work to setup the labels, ML models can start learning from the new incoming data that are being submitted for standardization, integration, and cleaning. The more data supplied to the model, the better the ML algorithm can perform and deliver accurate results. Therefore, ML is more scalable compared to traditional and time-consuming approaches. Nevertheless, many ML algorithms require an out-of-the-box tuning and their parameters and scope are often not adapted to the problem at hand. To make an example, in cleaning and integration processes, the window sizes of values used for the ML models cannot be arbitrarily chosen and require an adap- tation of the learning parameters. This tutorial will survey the recent trend of applying machine learning solutions to improve data management tasks and establish new paradigms to sharpen data error detection, cleaning, and integration at the data instance level, as well as at schema, system, and user levels.
Uploads
Papers by Laure Berti-Equille
(e.g., texts, tweets, Web tables, databases, multimedia-sharing platforms, etc.). Conflicting information,
rumors, erroneous and fake content can be easily spread across multiple sources, making
it hard to distinguish between what is true and what is not. This book gives an overview of fundamental
issues and recent contributions for ascertaining the veracity of data in the era of Big Data.
The text is organized into six chapters, focusing on structured data extracted from texts. Chapter 1
introduces the problem of ascertaining the veracity of data in a multi-source and evolving context.
Issues related to information extraction are presented in Chapter 2. Current truth discovery computation
algorithms are presented in details in Chapter 3. It is followed by practical techniques
for evaluating data source reputation and authoritativeness in Chapter 4. The theoretical foundations
and various approaches for modeling diffusion phenomenon of misinformation spreading in
networked systems is studied in Chapter 5. Finally, truth discovery computation from extracted
data in a dynamic context of misinformation propagation raises interesting challenges that are
explored in Chapter 6. This text is intended for a seminar course at the graduate level. It is also to serve as
a useful resource for researchers and practitioners who are interested in the study of fact-checking,
truth discovery, or rumor spreading.