FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Slyman, Eric; Lee, Stefan; Cohen, Scott; Kafle, Kushal

Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.16123 (cs)

[Submitted on 24 Apr 2024]

Title:FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Authors:Eric Slyman, Stefan Lee, Scott Cohen, Kushal Kafle

View PDF HTML (experimental)

Abstract:Recent dataset deduplication techniques have demonstrated that content-aware dataset pruning can dramatically reduce the cost of training Vision-Language Pretrained (VLP) models without significant performance losses compared to training on the original dataset. These results have been based on pruning commonly used image-caption datasets collected from the web -- datasets that are known to harbor harmful social biases that may then be codified in trained models. In this work, we evaluate how deduplication affects the prevalence of these biases in the resulting trained models and introduce an easy-to-implement modification to the recent SemDeDup algorithm that can reduce the negative effects that we observe. When examining CLIP-style models trained on deduplicated variants of LAION-400M, we find our proposed FairDeDup algorithm consistently leads to improved fairness metrics over SemDeDup on the FairFace and FACET datasets while maintaining zero-shot performance on CLIP benchmarks.

Comments:	Conference paper at CVPR 2024. 6 pages, 8 figures. Project Page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
ACM classes:	I.4.10; I.2.7; E.0
Cite as:	arXiv:2404.16123 [cs.CV]
	(or arXiv:2404.16123v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.16123

Submission history

From: Eric Slyman [view email]
[v1] Wed, 24 Apr 2024 18:28:17 UTC (3,833 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators