Fuzzy Overclustering

doi:10.5281/zenodo.5578454

Published October 5, 2021 | Version v3

Dataset Open

Fuzzy Overclustering

The datasets used for the paper "Fuzzy Overclustering: Semi-supervised classification of fuzzy labels with overclustering and inverse cross-entropy" in the open access journal Sensors (https://doi.org/10.3390/s21196661). The source code is available at https://github.com/Emprime/FuzzyOverclustering and the preprint at https://arxiv.org/abs/2110.06630 .

Please cite as

@Article{Schmarje2021foc,
AUTHOR = {Schmarje, Lars and Brünger, Johannes and Santarossa, Monty and Schröder, Simon-Martin and Kiko, Rainer and Koch, Reinhard},
TITLE = {Fuzzy Overclustering: Semi-Supervised Classification of Fuzzy Labels with Overclustering and Inverse Cross-Entropy},
JOURNAL = {Sensors},
VOLUME = {21},
YEAR = {2021},
NUMBER = {19},
ARTICLE-NUMBER = {6661},
URL = {https://www.mdpi.com/1424-8220/21/19/6661; https://doi.org/10.5281/zenodo.5550919},
ISSN = {1424-8220},
DOI = {10.3390/s21196661}
}

We provide the used plankton and synthetic datasets with the following explanations summarized from the above-mentioned paper. The technical descriptions are given below.

Plankton Dataset

The plankton dataset contains diverse grey-level images of marine planktonic organisms. The images were captured with an Underwater Vision Profiler 5 and are hosted on EcoTaxa (https://ecotaxa.obs-vlfr.fr/). In the citizen science project PlanktonID (https://planktonid.geomar.de/en), each sample was classified multiple times by citizen scientists. We used the data generated by version two of PlanktonID. This version is only available to users which have done at least 1000 annotations in version one. Therefore, we can assume that the annotators know the differences between the classes and are dedicated to creating consistent results. The second version of PlanktonID is a game where example images can be sorted into three proposed classes. If none of the proposed classes seem to fit, the user has the option to select any other of the 28 classes (https://planktonid.geomar.de/en/classes) of PlanktonID (including a 'no fitting category' class). The initial proposals are generated by a neural network, hence, a confirmation bias might be introduced. Some classes of the PlanktonID dataset have very few examples in comparison to others (e.g. three single images).
The dataset consists of 12,280 images in originally 28 classes. We picked the largest classes and merged some smaller classes (e.g. different classes of detritus). All other images are assigned to the class 'no fitting category'. We used 400 training and 200 validation images per class. We selected these images from the available data where all annotators agreed on the label. All other images are used as unlabeled data. If not enough images for training and validation were available, we used random duplicates.
For more details please see the main paper ( https://doi.org/10.3390/s21196661).

Synthetic Datasets

This dataset is a mixture of circles and ellipses (bubbles) on a black background with different colors. The 6 ground-truth classes are blue, red and green circles or ellipses. An image is defined as certain if the hue of the color is 0 (red), 120 (green) or 240 (blue) and the main axis ratio of the bubble is 1 (circle) or 2 (ellipse). Every other datapoint is considered fuzzy and the ground-truth label is the interpolation of the 6 ground-truth classes. The dataset consists of 1800 certain and 1000 fuzzy labeled images for train, validation and unlabeled data split.
We defined three subsets: Ideal, Real, Fuzzy. The Ideal subset uses the majority ground-truth label for every image which is in Reality not available. For the Real subset, the ground-truth classes in randomly picked from the ground-truth label distribution and represent an noisy / fuzzy annotation. The Fuzzy subset only uses certain labeled images as training data and represent a cleaned training dataset.
For more details please see the main paper ( https://doi.org/10.3390/s21196661).

Technical description

Each folder represents one dataset. The subfolders train, val and unlabeled represent the used data splits Training, Validation and Unlabeled respectively. The used ground-truth labels is given a folder name for each image. Each image is one datapoint.
The filename for the plankton data is structured as <CLASS>-<ECOTAXA_ID>.png
The filename for the synthetic data is structured as <SAMPLECOUNTER>-<HUEVALUE>-<AXIS_RATIO>.png

Files

Fuzzy Overclustering Datasets.zip

Files (97.8 MB)

Name	Size	Download all
Fuzzy Overclustering Datasets.zip md5:f358b80603fd0ad6648fa776dd478345	97.8 MB	Preview Download
syn-11.png md5:2181481fccbeb3cc1518ec9c73f1d322	4.2 kB	Preview Download

	All versions	This version
Views	657	128
Downloads	31	14
Data volume	978.2 MB	978.1 MB

Fuzzy Overclustering

Creators

Description

Files

Fuzzy Overclustering Datasets.zip

Files (97.8 MB)