Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleMarch 2024
Determining the Largest Overlap between Tables
Proceedings of the ACM on Management of Data (PACMMOD), Volume 2, Issue 1Article No.: 48, Pages 1–26https://doi.org/10.1145/3639303Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. ...
- research-articleMarch 2024
Discovering Functional Dependencies through Hitting Set Enumeration
Proceedings of the ACM on Management of Data (PACMMOD), Volume 2, Issue 1Article No.: 43, Pages 1–24https://doi.org/10.1145/3639298Functional dependencies (FDs) are among the most important integrity constraints in databases. They serve to normalize datasets and thus resolve redundancies, they contribute to query optimization, and they are frequently used to guide data cleaning ...
- editorialMarch 2024
- short-paperOctober 2023
MORPHER: Structural Transformation of Ill-formed Rows
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge ManagementOctober 2023, Pages 5051–5055https://doi.org/10.1145/3583780.3614747Open data portals contain a plethora of data files, with comma-separated value (CSV) files being particularly popular with users and businesses due to their flexible standard. However, this flexibility comes with much responsibility for data consumers, ...
BrewER: Entity Resolution On-Demand
Proceedings of the VLDB Endowment (PVLDB), Volume 16, Issue 12Pages 4026–4029https://doi.org/10.14778/3611540.3611612The task of entity resolution (ER) aims to detect multiple records describing the same real-world entity in datasets and to consolidate them into a single consistent record. ER plays a fundamental role in guaranteeing good data quality, e.g., as input ...
-
- short-paperJune 2023
BCNF* - From Normalized- to Star-Schemas and Back Again
- Marie Fischer,
- Paul Roessler,
- Paul Sieben,
- Janina Adamcic,
- Christoph Kirchherr,
- Tobias Straeubig,
- Youri Kaminsky,
- Felix Naumann
SIGMOD '23: Companion of the 2023 International Conference on Management of DataJune 2023, Pages 103–106https://doi.org/10.1145/3555041.3589712Data warehouses are the core of many data analysis processes. They contain various database schemas, which are designed and created through schema transformation and integration. These processes are complex and require technical knowledge, which makes ...
- research-articleMay 2023
Discovering Similarity Inclusion Dependencies
Proceedings of the ACM on Management of Data (PACMMOD), Volume 1, Issue 1Article No.: 75, Pages 1–24https://doi.org/10.1145/3588929Inclusion dependencies (INDs) are a well-known type of data dependency, specifying that the values of one column are contained in those of another column. INDs can be used for various purposes, such as foreign-key candidate selection or join partner ...
- research-articleMay 2023
Matching Roles from Temporal Data: Why Joe Biden is not only President, but also Commander-in-Chief
- Leon Bornemann,
- Tobias Bleifuß,
- Dmitri V. Kalashnikov,
- Fatemeh Nargesian,
- Felix Naumann,
- Divesh Srivastava
Proceedings of the ACM on Management of Data (PACMMOD), Volume 1, Issue 1Article No.: 65, Pages 1–26https://doi.org/10.1145/3588919We present role matching, a novel, fine-grained integrity constraint on temporal fact data, i.e., (subject, predicate, object, timestamp)-quadruples. A role is a combination of subject and predicate and can be associated with different objects as the ...
Pollock: A Data Loading Benchmark
Proceedings of the VLDB Endowment (PVLDB), Volume 16, Issue 8Pages 1870–1882https://doi.org/10.14778/3594512.3594518Any system at play in a data-driven project has a fundamental requirement: the ability to load data. The de-facto standard format to distribute and consume raw data is csv. Yet, the plain text and flexible nature of this format make such files often ...
Fast Algorithms for Denial Constraint Discovery
Proceedings of the VLDB Endowment (PVLDB), Volume 16, Issue 4Pages 684–696https://doi.org/10.14778/3574245.3574254Denial constraints (DCs) are an integrity constraint formalism widely used to detect inconsistencies in data. Several algorithms have been devised to discover DCs from data, as manually specifying them is burdensome and, worse yet, error-prone. The ...
- keynoteOctober 2022
Exploring and Analyzing Change: The Janus Project
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge ManagementOctober 2022, Page 3https://doi.org/10.1145/3511808.3555799Data change, all the time. The Janus project seeks to address the Variability dimension of Big Data by modeling, exploring, and analyzing such change, providing valuable insights into the evolving real world and ways in which data about it are collected ...
Frost: a platform for benchmarking and exploring data matching results
- Martin Graf,
- Lukas Laskowski,
- Florian Papsdorf,
- Florian Sold,
- Roland Gremmelspacher,
- Felix Naumann,
- Fabian Panse
Proceedings of the VLDB Endowment (PVLDB), Volume 15, Issue 12Pages 3292–3305https://doi.org/10.14778/3554821.3554823"Bad" data has a direct impact on 88% of companies, with the average company losing 12% of its revenue due to it. Duplicates - multiple but different representations of the same real-world entities - are among the main reasons for poor data quality, so ...
- articleJuly 2022
Diversity and Inclusion Activities in Database Conferences: A 2021 Report
- Sihem Amer-Yahia,
- Yael Amsterdamer,
- Sourav S. Bhowmick,
- Angela Bonifati,
- Philippe Bonnet,
- Renata Borovica-Gajic,
- Barbara Catania,
- Tania Cerquitelli,
- Silvia Chiusano,
- Panos K. Chrysanthis,
- Carlo Curino,
- Jérôme Darmont,
- Amr El Abbadi,
- Avrilia Floratou,
- Juliana Freire,
- Alekh Jindal,
- Vana Kalogeraki,
- Georgia Koutrika,
- Arun Kumar,
- Sujaya Maiyya,
- Alexandra Meliou,
- Madhulika Mohanty,
- Felix Naumann,
- Nele Sina Noack,
- Fatma Özcan,
- Liat Peterfreund,
- Wenny Rahayu,
- Wang-Chiew Tan,
- Yuanyuan Tian,
- Pinar Tözün,
- Genoveva Vargas-Solar,
- Neeraja Yadwadkar,
- Meihui Zhang
ACM SIGMOD Record (SIGMOD), Volume 51, Issue 2June 2022, Pages 69–73https://doi.org/10.1145/3552490.3552510Diversity and Inclusion (D&I) are core to fostering innovative thinking. Existing theories demonstrate that to facilitate inclusion, multiple types of exclusionary dynamics, such as self-segregation, communication apprehension, and stereotyping and ...
- research-articleJune 2022
AI Compliance – Challenges of Bridging Data Science and Law
Journal of Data and Information Quality (JDIQ), Volume 14, Issue 3Article No.: 21, Pages 1–4https://doi.org/10.1145/3531532This vision article outlines the main building blocks of what we term AI Compliance, an effort to bridge two complementary research areas: computer science and the law. Such research has the goal to model, measure, and affect the quality of AI artifacts, ...
- short-paperJune 2022
Mondrian: Spreadsheet Layout Detection
SIGMOD '22: Proceedings of the 2022 International Conference on Management of DataJune 2022, Pages 2361–2364https://doi.org/10.1145/3514221.3520152Spreadsheet datasets are valuable sources of data, but often ill-suited for machine consumption. Their unstructured nature allows users to arrange data and metadata freely in a human-readable format, often in canvas-like layouts. To extract their ...
- ArticleMarch 2022
Amending RDF Entities with New Facts
The Semantic Web: ESWC 2014 Satellite EventsPages 131–143https://doi.org/10.1007/978-3-319-11955-7_11AbstractLinked and other Open Data poses new challenges and opportunities for the data mining community. Unfortunately, the large volume and great heterogeneity of available open data requires significant integration steps before it can be used in ...
Entity resolution on-demand
Proceedings of the VLDB Endowment (PVLDB), Volume 15, Issue 7Pages 1506–1518https://doi.org/10.14778/3523210.3523226Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entity. ER is typically employed as an expensive cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned ...
- research-articleJanuary 2022
VLDB 2021: Designing a Hybrid Conference
ACM SIGMOD Record (SIGMOD), Volume 50, Issue 4December 2021, Pages 50–53https://doi.org/10.1145/3516431.3516447The 47th International Conference on Very Large Databases (VLDB'21) was held on August 16-20, 2021 as a hybrid conference. It attracted 180 in-person attendees in Copenhagen and 840 remote attendees. In this paper, we describe our key decisions as ...
- research-articleJanuary 2022
How Inclusive are We?
ACM SIGMOD Record (SIGMOD), Volume 50, Issue 4December 2021, Pages 30–35https://doi.org/10.1145/3516431.3516438ACM SIGMOD, VLDB and other database organizations have committed to fostering an inclusive and diverse community, as do many other scientific organizations. Recently, different measures have been taken to advance these goals, especially for ...