No abstract available.
Proceeding Downloads
Croissant: A Metadata Format for ML-Ready Datasets
- Mubashara Akhtar,
- Omar Benjelloun,
- Costanza Conforti,
- Pieter Gijsbers,
- Joan Giner-Miguelez,
- Nitisha Jain,
- Michael Kuchnik,
- Quentin Lhoest,
- Pierre Marcenac,
- Manil Maskey,
- Peter Mattson,
- Luis Oala,
- Pierre Ruyssen,
- Rajat Shinde,
- Elena Simperl,
- Goeffry Thomas,
- Slava Tykhonov,
- Joaquin Vanschoren,
- Jos van der Velde,
- Steffen Vogler,
- Carole-Jean Wu
Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes ...
Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"
Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. ...
tailwiz: Empowering Domain Experts with Easy-to-Use, Task-Specific Natural Language Processing Models
Experts outside the field of machine learning (ML) are interested in using ML techniques to analyze their textual data, but they are inhibited by a lack of convenient natural language processing (NLP) tools. To address this issue, we present tailwiz, an ...
AIDB: a Sparsely Materialized Database for Queries using Machine Learning
Analysts and scientists are interested in automatically analyzing the semantic contents of unstructured, non-tabular data (videos, images, text, and audio). These analysts have turned to unstructured data systems leveraging machine learning (ML). The ...
Reaching the Edge of the Edge: Image Analysis in Space
Satellites have become more widely available due to the reduction in size and cost of their components. As a result, there has been an advent of smaller organizations having the ability to deploy satellites with a variety of data-intensive applications ...
Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly
With the emergence of AI regulations, such as the EU AI Act, requirements for simple data lineage, enforcement of low data bias, and energy efficiency have become a priority for everyone offering AI services. Being pre-trained on versatile and a vast ...
Reactive Dataflow for Inflight Error Handling in ML Workflows
- Abhilash Jindal,
- Kaustubh Beedkar,
- Vishal Singh,
- J. Nausheen Mohammed,
- Tushar Singla,
- Aman Gupta,
- Keerti Choudhary
Modern data analytics pipelines comprise traditional data transformation operations and pre-trained ML models deployed as user-defined functions (UDFs). Such pipelines, which we call ML workflows, generally produce erroneous results due to data errors ...
Towards Efficient Data Wrangling with LLMs using Code Generation
While LLM-based data wrangling approaches that process each row of data have shown promising benchmark results, computational costs still limit their suitability for real-world use cases on large datasets. We revisit code generation using LLMs for ...
Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie
As the Lakehouse architecture becomes more widespread, ensuring the reproducibility of data workloads over data lakes emerges as a crucial concern for data practitioners. However, achieving reproducibility remains challenging. The size of data pipelines ...
Nautilus: A Benchmarking Platform for DBMS Knob Tuning
Recent research has shown the importance of tuning DBMS configuration knobs to achieve high performance. As a result, a large number of search-based and learning-based auto-tuning methods have been proposed. However, despite the promising results, we ...
DLProv: A Data-Centric Support for Deep Learning Workflow Analyses
The Deep Learning (DL) workflow involves several steps of data transformation. Evaluating various configurations at each step of the workflow may be a complex task when it comes to selecting DL models. This decision-making process requires basing ...