Automatic Labels are as Effective as Manual Labels in Biomedical Images Classification with Deep Learning

Niccolò Marini¹ Stefano Marchesin² Lluis Borras Ferris¹ Simon Püttmann³ Marek Wodzinski^1,4 Riccardo Fratti¹ Damian Podareanu⁵ Alessandro Caputo^6,7 Svetla Boytcheva^8,9 Simona Vatrano⁷ Filippo Fraggetta⁷ Iris Nagtegaal¹⁰ Gianmaria Silvello² Manfredo Atzori^1,11 Henning Müller^1,12

Abstract

Background: The increasing availability of biomedical data is helping to design more robust deep learning (DL) algorithms to analyze biomedical samples. Currently, one of the main limitations to train DL algorithms to perform a specific task is the need for medical experts to label data. Automatic methods to label data exist, however automatic labels can be noisy and it is not completely clear when automatic labels can be adopted to train DL models.

Method: This paper aims to investigate under which circumstances automatic labels can be adopted to train a DL model on the classification of Whole Slide Images (WSI). The analysis involves multiple architectures, such as Convolutional Neural Networks (CNN) and Vision Transformer (ViT), and over 10’000 WSIs, collected from three use cases: celiac disease, lung cancer and colon cancer, which one including respectively binary, multiclass and multilabel data.

Results: The results allow identifying 10% as the percentage of noisy labels that lead to train competitive models for the classification of WSIs. Therefore, an algorithm generating automatic labels needs to fit this criterion to be adopted. The application of the Semantic Knowledge Extractor Tool (SKET) algorithm to generate automatic labels leads to performance comparable to the one obtained with manual labels, since it generates a percentage of noisy labels between 2-5%.

Conclusions: Automatic labels are as effective as manual ones, reaching solid performance comparable to the one obtained training models with manual labels.

¹Information Systems Institute, University of Applied Sciences Western Switzerland (HES-SO Valais), Sierre, Switzerland
²Department of Information Engineering, University of Padua, Padua, Italy
³University of Applied Sciences and Arts Dortmund, Dortmund, Germany
⁴Department of Measurement and Electronics, AGH University of Kraków, Krakow, Poland
⁵SURFsara, Amsterdam, The Netherlands
⁶Department of Pathology, Ruggi University Hospital, Salerno, Italy
⁷Pathology Unit, Gravina Hospital Caltagirone ASP, Catania, Italy
⁸Ontotext, Sofia, Bulgaria
⁹Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, Sofia, Bulgaria
¹⁰Department of Pathology, Radboud University Medical Center, Nijmegen, The Netherlands
¹¹Department of Neurosciences, University of Padua, Padua, Italy
¹²Medical faculty, University of Geneva, 1211 Geneva, Switzerland
^∗Both authors contributed equally to this work.

Keywords Automatic Weak Labels $\cdot$ Deep Learning $\cdot$ Histopathology Image Classification $\cdot$ Noisy Labels

1 Introduction

1.1 Background

The development of deep learning (DL) algorithms is fostering the design of new tools that can be trained on clinical data without the need for human intervention, especially in domains where the cost of annotations is high, such as histopathology. Histopathology is the gold standard to diagnose cancer (Van der Laak et al., 2021; De Matos et al., 2021). The domain involves the analysis of small tissue slices, to identify microscopic findings related to dangerous diseases (Gurcan et al., 2009), such as cancer. Tissue slices undergo microscopic examination by a medical expert named pathologist, who usually needs up to an hour per image to analyze a single sample (Krupinski et al., 2013). Despite the increasing digitization of tissue samples, histopathological samples are still rarely analyzed exploiting digital aid in clinical practice (Fraggetta et al., 2017, 2021). Digital pathology is a domain involving the management and digitization of tissue specimens, called Whole Slide Images (WSI). WSIs are high-resolution images stored with a pyramidal format, to capture different magnification levels of details (Merchant and Castleman, 2022). Usually, the highest resolution levels result in a spatial high-resolution of 0.25–0.5 $\mu$ m per pixel, which corresponds to an optical resolution of 20-40x. WSIs are usually coupled with pathology reports. Pathology reports are semi-structured free-text documents containing information about the patient anamnesis, the tissue specimen type and the findings and observations identified by a pathologist during the tissue examination (Hewer, 2020; Hanna et al., 2020). WSIs and reports are usually stored in the Laboratory Information System (LIS), easily enabling sample retrieval. The increasing collection of biomedical samples is encouraging the design of automatic tools to analyze WSIs under the computational pathology domain (Van der Laak et al., 2021; Madabhushi and Lee, 2016; Litjens et al., 2022). Most of the computational domain algorithms are currently based on deep learning, such as CNNs (Convolutional Neural Networks) or ViT (Visual Transformers) (Xu et al., 2023; Cifci et al., 2023).

Even if computational pathology algorithms show accurate and robust performance, in tasks such as WSI classification or segmentation, several challenges are still open, such as data labels (Madabhushi and Lee, 2016; Campanella et al., 2019; Van der Laak et al., 2021; Abels et al., 2019; Chen et al., 2022). Data labels are required to train supervised learning algorithms. However, the collection of labels is not trivial, considering both strong and weak annotations. Even if strong labels (i.e. pixel-wise annotations) usually achieve the most accurate performance when used to train a deep learning model, they require a pathologist to analyze samples, which can be time-consuming, so often unfeasible (Karimi et al., 2020). Therefore, the research based on the analysis of WSIs is mostly based on the exploitation of weak (i.e. image-level) labels. Weak labels are related to the global image, even if they originate from a region of the image including specific characteristics, such as cancer (Deng et al., 2020). Weak labels are inherently more noisy than pixel-wise annotations, since the regions leading to a specific label may be a small percentage of the whole image (e.g. 1-2%). For this reason, algorithms based on weak labels require larger training datasets to reach accurate performance. Currently, most of weakly-supervised algorithms in computational pathology are based on Multiple Instance Learning (MIL) framework (Carbonneau et al., 2018), which models the whole image as a bag of instances, where only the annotations about the global image are available. MIL framework includes several algorithms, which lately showed high performance when adopted on large-scale datasets (Campanella et al., 2019; Ilse et al., 2018; Wang et al., 2019; Lu et al., 2021; Hashimoto et al., 2020; Chen et al., 2022). For example, Campanella et al. (2019) showed that it is possible to reach almost perfect predictions on binary classification (cancer vs. non-cancer) using around 10’000 weakly-annotated WSIs, on three use cases: skin, breast, and prostate images. The production of weak labels is faster than the strong ones, since they can be extracted from reports. For example, the analysis of a report may require approximately 30 seconds/1 minute, in comparison with the analysis of an image, which requires around an hour. However, human intervention is usually still required to analyze reports, unless the Laboratory Information System (LIS), where the samples and corresponding reports are stored, has a specific structure to retrieve automatically data, according to the characteristics that can be used as labels. Unfortunately, most LISs do not show this feature, since they are organized in heterogeneous ways.

Automatic methods for extracting concepts from reports and using them as weak labels already exist (Marini et al., 2022), but noisy characteristics of weak labels can make automatic labeling ineffective. This paper aims to investigate under which circumstances automatic labels (i.e. labels automatically generated by an algorithm) can be adopted to train deep learning models to alleviate the need for experts to annotate data. In particular, the goal is to identify when the results achieved using this type of labels reach results comparable to the ones obtained using manual labels (i.e. labels produced by a medical expert), so that data included in LISs can be fully exploited to build more robust and accurate tools to diagnose diseases. The characteristics investigated in the paper involve the percentage of wrongly automatic labels necessary to reach comparable performance obtained with manual labels, the nature of labels (i.e. binary, multiclass and multilabel) and the deep learning architecture (robust or less robust to noise). Wrongly automatic labels are annotations that are automatically produced by an algorithm and do not match the ground truth (i.e. manually-made).

1.2 Contribution

The paper includes a comparison of deep learning architecture trained with automatic and manual labels on the classification of WSIs. The comparison involves two sets of experiments: a controlled scenario and a real-case scenario. In the controlled scenario, manual labels are randomly perturbed with different percentages of noise, simulating the output of an algorithm to generate automatic labels. The random perturbation involves a modification on the labels: in the celiac disease use case (binary), labels are flipped; in the lung cancer use case (multiclass), a different class is assigned to a sample; in the colon cancer use case (multilabel), labels are modified so that one or more classes from the original label are flipped. In the real-case scenario, the Semantic Knowledge Extractor Tool (SKET) (Marchesin et al., 2022) is used to extract meaningful concepts from reports that are used as weak labels for the corresponding samples.

The analysis involves three tissue use cases, celiac disease, lung cancer and colon cancer, composing a training dataset with over 10’000 WSIs, used to train three deep learning architectures: CLAM (Lu et al., 2021), transMIL (Shao et al., 2021) and Vision Transformer (ViT) (Chen et al., 2022).

Refer to caption — Figure 1: Overview of the tissue use cases analyzed in the paper. The upper line includes examples of duodenal tissue samples, related to celiac disease. The central line includes examples of lung tissue samples. The bottom line includes examples of colon samples.

Celiac disease (CD) is an autoimmune disorder leading to inflammations and damage in the small intestine, resulting in a range of gastrointestinal and systemic symptoms (Caio et al., 2019). Globally, celiac disease affects about 1-2% of the population (Lebwohl and Rubio-Tapia, 2021), with variations across regions. Celiac disease diagnosis involves duodenal biopsies and serological tests (for specific antibodies). In particular, the examination of biopsies aims to identify villous atrophy, crypt hyperplasia and increased intraepithelial lymphocytes. In this paper, duodenal samples are labeled with celiac disease or normal tissue (binary labels).

Lung cancer is the leading cause of death related to cancer worldwide (Schabath and Cote, 2019; Organization, 2023). It is categorized into two main primary groups: Non-Small Cell Lung Cancer (NSCLC), which represents the large majority of cases (about 85% of cases), and Small-Cell Lung Cancer (SCLC), which is less common, but more aggressive. Furthermore, NSCLC is further described with subtypes, such as LUng ADenocarcinoma (LUAD), LUng Squamous cell Carcinoma (LUSC). Diagnosis of lung cancer through biopsies often involves the identification of irregular cell patterns, architectural distortion, and increased cellular density (Travis, 2011). In this paper, lung samples are labeled with SCLC, LUAD, LUSC, Normal Tissue.

Colon cancer is the fourth most often diagnosed cancer worldwide (Benson et al., 2018). Colon cancer diagnosis involves the identification of multiple concepts, such as the presence of cancer and the evaluation of polyp shapes and possible abnormalities leading to dysplasia. In this paper, colon samples are labeled with colon cancer, high-grade dysplasia (HGD), low-grade dysplasia (LGD), hyperplastic polyp and normal tissue (multilabel labels). Figure 1 shows some histopathological samples corresponding to the three tissues.

2 Materials and Methods

2.1 Dataset composition

The dataset used in this paper includes WSIs and reports (paired together) of celiac disease, lung cancer and colon cancer, collected from two hospitals: the Catania cohort and Radboudumc (RUMC).

WSIs are used to train and test different computer vision architectures on the image-level classification. WSIs are gigapixel tissue samples that can exhibit significant heterogeneity, for example in terms of staining (Marini et al., 2021a, 2023) and sample types. The image heterogeneity is a consequence of different acquisition procedures across laboratories, related to the chemical reagents applied to the specimen and to the whole slide scanners. One of the main consequences of the heterogeneity is the stain variability, leading to different color variations, intensity, and uniformity of stains across different slides (as shown in Figure 1). Also, the WSIs collected in this dataset show the same characteristic, aiming to replicate a common scenario in digital pathology. WSIs collected from the Catania cohort were scanned with two 3DHistech scanners and two Aperio scanners and stored with a magnification of 20-40x; WSIs collected from RUMC were scanned using 3DHistech scanners, mainly stored at 40x magnification.

Reports are used to extract meaningful concepts, that are used as weak automatic labels to train the model to classify WSIs. Reports include free-text descriptions summarizing the findings from tissue examination. The findings are reported in a field named ‘Conclusion’, containing either macroscopic or microscopic observations. Even if a report includes many fields, only the findings are relevant for the analysis proposed in the paper. Therefore, additional patient information, such as family history or personal data, is discarded. Textual reports show heterogeneity, mainly related to the source language and the textual content. Reports are collected from an Italian and a Dutch hospital, therefore they have to be translated into English, to standardize the analysis. The textual content slightly differs across sources, because the Catania cohort reports contain a field specifically for the findings identified in a single slide, while the RUMC reports include a specific field for the findings identified in a tissue block, which may encompass multiple slides. Furthermore, samples are collected across years and are produced by many different pathologists, each one adopting its unique style of writing.

The dataset includes samples collected from three different use cases: celiac disease, lung cancer, colon cancer. Data are randomly selected from LISs, to simulate a real-case scenario. The goal is to show that the approach can generalize on different types of tissue (both in terms of images and reports). Furthermore, different types of labels are used: celiac disease samples are annotated with binary labels, lung samples with multiclass labels, and colon samples with multilabel samples.

Table 1: Composition of the samples related to the celiac disease use case. Data are labeled with binary labels: celiac disease and normal tissue. The dataset is split into training and testing partitions. The model is trained and validated adopting a 10-fold cross-validation approach.

Source	Celiac Disease	Normal Tissue	Total
Training dataset: Automatic Labels
Catania	47	711	758
RUMC	217	524	741
Total	264	1235	1499
Training dataset: Manual Labels
Catania	61	697	758
RUMC	223	518	741
Total	284	1235	1499
Testing dataset
Catania	10	83	93
RUMC	37	63	100
Total	47	146	193

Table 1 includes a detailed composition of data related to celiac disease collected from pathology reports, split into training and testing partitions. Data are labeled with binary labels: celiac disease and normal tissue.

Table 2: Composition of the samples related to the lung cancer use case. Data are labeled with multiclass labels: Small-Cell Cancer, Non-Small Adenocarcinoma Cell Cancer, Non-Small Squamous Cell Cancer, Normal Tissue. The dataset is split into training and testing partitions. The model is trained and validated adopting a 10-fold cross-validation approach.

Source	SCLC	LUAD	LUSC	Normal	Total
Training dataset: Automatic Labels
Catania	49	526	250	226	1051
RUMC	1	262	195	1041	1499
Total	50	788	445	1267	2550
Training dataset: Manual Labels
Catania	50	519	271	211	1051
RUMC	1	260	173	1065	1499
Total	51	779	444	1276	2550
Testing dataset
Catania	12	62	67	32	173
RUMC	0	55	29	110	194
Total	12	117	96	142	367

Table 2 includes a detailed composition of data related to lung cancer collected from pathology reports, split in training and testing partitions. Data are labeled with multiclass labels: Small-Cell Cancer, Non-Small Adenocarcinoma Cell Cancer, Non-Small Squamous Cell Cancer, Normal Tissue.

Table 3: Composition of the samples related to the colon cancer use case. Data are labeled with multilabel annotations: Adenocarcinoma, High-Grade Dysplasia (HGD), Low-Grade Dysplasia (LGD), Hyperplastic Polyp, Normal Tissue. Due to the multilabel nature of labels, the total samples for each class may not correspond to the total number of samples. The dataset is split into training and testing partitions. The model is trained and validated adopting a 10-fold cross-validation approach.

Source	Adenocarcinoma	HGD	LGD	Hyperplastic	Normal	Total
Training dataset: Automatic Labels
Catania	776	761	1288	511	596	3095
RUMC	383	377	853	943	1341	3460
Total	1159	1138	2141	1454	1937	6555
Training dataset: Manual Labels
Catania	865	774	1273	535	570	3095
RUMC	394	362	878	965	1309	3460
Total	1259	1136	2151	1500	1879	6555
Testing dataset
Catania	111	96	113	32	98	348
RUMC	75	65	146	119	193	520
Total	186	161	259	151	291	868

Table 3 includes a detailed composition of data related to colon cancer collected from pathology reports, split into training and testing partitions. Data are labeled with multilabel labels: Adenocarcinoma, High-Grade Dysplasia (HGD), Low-Grade Dysplasia (LGD), Hyperplastic Polyp, Normal Tissue.

2.2 Data analysis pipeline

The training schema is based on computer vision algorithms to classify WSIs, comparing the performance of automatic and manual labels during the training. Those algorithms are based on weak labels, since they are easier to be collected, even if they still require the intervention of medical experts. In this paper, three different MIL backbones are adopted: two CNNs, CLAM and transMIL, and a ViT. The architectures are trained to evaluate the effect that automatic labels may have on the training of models to classify WSIs. Firstly, they are trained with noisy labels, randomly generated perturbating the manual labels, with a different percentage (1,2,5,10,20,50%) of noise. The goal of this experiment is to evaluate the effect that noisy labels have on the performance of a model. However, this setup does not fit a real-case scenario where automatic labels are adopted. Noisy labels may be considered as wrongly-labeled samples, but not all mistakes on labels have the same likelihood to happen. Considering for example weak labels inferred by reports: some reports, due to their content, may be more easily mislabeled. For this reason, a second setup is proposed, adopting a real tool to extract concepts from reports: the Semantic Knowledge Extractor Tool (SKET) (Marchesin et al., 2022). This second experiment also helps to test the rules identified within the first experiment.

Figure 2 shows an overview of the data analysis pipeline.

2.3 Computer vision architectures

Three computer vision algorithms to classify WSIs are compared in the paper as backbones, to evaluate the effect that noisy labels may have on different architectures, including two CNNs and a ViT. The CNNs have a ResNet34 backbone, while the ViT has a backbone similar to the one shown in Chen et al. (2022), considering a single magnification level. In both cases, the backbones are designed to output an embedding of size 128 representing a single WSI, so that the same classifier can be adopted for all architectures, modifying the output classes based on the use case.

CLAM

Clustering-constrained Attention Multiple Instance Learning (CLAM) (Lu et al., 2021) is a MIL framework based on an attention-based network, whose goal is to highlight relevant regions inside the WSI, to improve the WSI-level prediction. CLAM exploits a mechanism on the single instances to aggregate them on clusters, according to the instance similarity, to enrich the WSI-representation and reach higher WSI-level predictions. CLAM can have one or more attention branches, depending on the number of classes. In this paper, a single attention branch (CLAM_SB) is used when the model is used on celiac disease (binary labels), while instead a multiple attention branch (CLAM_MB) is used on the other two use cases.

transMIL

transMIL (Shao et al., 2021) is a MIL framework developed to exploit the morphological and spatial characteristics of WSIs. Even if morphological and spatial characteristics of images are important, the attention mechanism does not take them into account when evaluating input instances. transMIL exploits Transformer architectures (Vaswani et al., 2017) to highlight relationships between single instances, modeling input instances as a sequence of tokens and evaluating the similarity among instances.

Vision Transformer

Vision Transformer (Sharir et al., 2021; Han et al., 2020) is a deep learning architecture adopted to analyze images, adopting the self-attention mechanism to process input data instead of convolutional layers, showing more competitive performance in terms of accuracy and efficiency. The architecture processes input data as a sequence of input tokens, that are small sub-regions of the input image (usually 16x16 pixels). The architecture includes 12 encoder layers producing the embedding to feed the classifier.

2.4 Semantic Knowledge Extractor Tool (SKET)

SKET (Marchesin et al., 2022) is an unsupervised algorithm combining a rule-based expert system with machine learning models, chosen to extract meaningful concepts from reports and use them as weak labels for WSIs (Marchesin et al., 2022; Menotti et al., 2023). The algorithm includes three components: Named Entity Recognition, Entity Linking and Data Labeling. Named Entity Recognition involves pre-trained models (ScispaCy models (Neumann et al., 2019)), developed to work on biomedical data, and large Word2Vec word vectors (Mikolov et al., 2013) trained on the PubMed Central Open Access Subset (Mikolov et al., 2013). Entity Linking involves a combination of similarity matching techniques to match ad-hoc concepts to a reference ontology. Data Labeling involves the mapping of the concepts with a set of annotation classes. SKET is an unsupervised model, therefore no training data are required to tune it. This feature is relevant since it does not require data annotation for training, such as other Natural Language Processing (NLP) algorithms.

3 Experimental Setup

3.1 Image pre-processing

Image pre-processing includes the WSI splitting into patches. WSIs usually do not fit modern GPU hardware memory because of their gigapixel characteristics, therefore they have to be split into patches. In this paper, WSIs are split 224x224 pixel patches using the Multi_Scale_Tools library (Marini et al., 2021b). The choice of the size is related to the characteristics of ResNet34 backbone, requiring fixed input size. Patches are extracted from magnification 5x considering celiac samples, while lung and colon patches are sampled from magnification 10x. The magnifications are chosen considering that the magnification allows to identify peculiar morphological features, useful for the classification task: 5x for celiac disease cases, 10x for lung and colon ones. The choice of the magnification to examine is driven by the characteristics of the problem to solve: celiac disease diagnosis requires to identify the villous shape and the crypts, therefore 5x magnification is chosen; on the other hand, lung and colon require a more refined level of magnification, because the shape of glands is as relevant as the cell infiltration, therefore 10x is chosen. Not all sampled patches are selected: the ones from background regions are discarded, being not informative. The identification of background regions involves the application of HistoQC tool (Janowczyk et al., 2019), which generates tissue masks.

3.2 Report pre-processing

The report pre-processing only involves their translation into English. Original reports are stored in Italian and Dutch, depending on the workflow from which they are collected. The translation is necessary because state-of-the-art NLP algorithms are mostly developed to work with inputs in English. MarianMT neural machine translation models (Junczys-Dowmunt et al., 2018) are used to translate the content of the reports to English.

3.3 Architecture pre-training

The backbones of deep learning algorithms to analyze images are pre-trained using self-supervised algorithms: simCLR (Chen et al., 2020) for the CNNs (CLAM and transMIL), DINO v2 (Oquab et al., 2023) for the ViT.

Both algorithms are adopted to learn meaningful features from unannotated input data, exploiting similarities and dissimilarities between input samples. In this paper, the input data for the algorithms are the patches sampled from the training partition. Since data are unannotated, no information is available regarding similarity among patches. Therefore, data augmentation is adopted: samples are similar to their augmented versions and dissimilar from the other samples within a batch. The algorithms differ on the data augmentation strategy. simCLR is designed for CNNs and its augmentation pipeline includes several operations, applied with a probability of 0.5: random rotations (90/180/270 degrees), vertical/horizontal flipping, hue-saturation-contrast (HUE) color augmentation, RGB shift, color jitter, gaussian noise, elastic transformation, grid distortions. DINO is designed for ViT and involves a knowledge distillation mechanism: two networks are involved in the training, a teacher and a student. The teacher is a larger model producing outputs that the student aims to mimic and replicate. Both models are directly trained with two different augmented versions of input samples. However, the student is also trained with a cropped version (96x96 pixels) of the teacher inputs. DINO v2 augmentation pipeline includes two augmentation pipelines: the first one includes color jitter, horizontal/vertical flipping, gaussian blur; the second one includes color jitter, horizontal/vertical flipping, gaussian blur, solarization.

3.4 Image data augmentation pipeline

Albumentations library (Buslaev et al., 2020) is adopted to apply data augmentation to input images. The operations involved are random rotations (90/180/270 degrees), vertical/horizontal flipping and hue-saturation-contrast (HUE) color augmentation. The operations from the data augmentation pipeline are selected with a probability of 0.5 and applied at image-level, so that all the patches are augmented consistently

3.5 Metric to evaluate the performance

The performance of the models is evaluated in terms of WSI classification, using the weighted F1-score. The classification problem can be defined as a binary problem (celiac disease), multiclass problem (lung cancer) or multilabel problem (colon cancer). F1-score is a metric to measure the accuracy of a classifier, combining the recall and the precision. Precision evaluates how well a classifier is robust to avoid predicting negative samples as positive ones, while recall evaluates how well it correctly classifies all the positive samples. In all the use cases, data may show unbalanced class distribution, since they are randomly selected from workflows, aiming to simulate a real-case scenario. For this reason, weighted macro F1-score is adopted Weighted F1-score tackles class imbalance, evaluating the F1-scores for the single classes and then averaging them according the class support (number of true samples for the class). The weighted F1-score is reported in terms of average and standard deviation of the ten experiment repetitions, evaluated on the test partition.

3.6 Statistical significance test

The performance difference among different setups is evaluated through the Wilcoxon Rank-Sum test (Woolson, 2007). The test aims to establish if the results of two different experiments are statistically significantly different ( $p$ -value < 0.05).

3.7 K-fold cross-validation

All the setups presented in the paper are trained using $k$ -fold cross-validation, in order to evaluate the robustness of the model on data used for training. The training partition is divided into $k$ folders ( $k$ =10 in this paper). During every training repetition, k-1 folders are used to train the model, while the other group is used to validate it. Data are split in partitions considering the patients, so that WSIs collected from a patient cannot be in two different partitions

3.8 Hardware and Software

The experiments are developed exploiting Python libraries. The deep learning algorithms are implemented and trained using PyTorch 2.2.0 and run on a Tesla V100 GPU. WSIs are accessed using openslide 3.4.1 (Goode et al., 2013). WSI pre-processing involves Multi_Scale_Tools library (Marini et al., 2021b) and data augmentation is applied using albumentations 1.3.1 (Buslaev et al., 2020). The performance of the model is quantitavely evaluated using the metrics implemented by scikit-learn 0.22.

3.9 Hyperparameters

The optimal configuration setup of both CNN and ViT hyperparameters is identified using the grid search algorithm. The optimal set is the one reaching the lowest loss function of the classification of WSIs, considering the validation partition. The parameters tested with the grid search algorithm are: the batch size (4 selected; 1,2,4,8 tested); the CNN optimizer (Adam selected); the ViT optimizer (Adam selected; Adam, LARS and AdamW tested); the number of epochs when the CNN model is trained (15; over this number of epochs, the loss function evaluated on the validation partition no longer decreases); the number of epochs when the HIPT model is trained (15; over this number of epochs, the loss function evaluated on the validation partition no longer decreases); the learning rate ( $10^{-4}$ ; $10^{-2}$ , $10^{-3}$ , $10^{-4}$ , $10^{-5}$ were tested); the decay rate ( $10^{-4}$ ; $10^{-2}$ , $10^{-3}$ , $10^{-4}$ , $10^{-5}$ were tested); the number of nodes in the intermediate layer after the ResNet and the ViT backbone (128; 64, 128, 256, 512 were tested).

4 Results

4.1 Automatic labels

Table 4: Overview of the performance reached by SKET on the extract of meaningful concepts from pathology reports, evaluated in terms of F1-score. The algorithm is evaluated considering the training partitions of three use cases (celiac disease, lung cancer, colon cancer), since SKET does not require any training. The performance is evaluated considering data from Catania, RUMC and their combination.

Use case	Catania	RUMC	Cumulative
Celiac Disease	0.860	0.964	0.944
Lung Cancer	0.969	0.975	0.976
Colon Cancer	0.976	0.961	0.971

Meaningful concepts can be extracted from pathology reports without the need for human intervention and can be adopted as weak labels, dramatically reducing the time needed to collect labels.

The performance of SKET (tool to extract weak labels from reports) is evaluated on the training partition of the three use cases, since SKET does not require any training, being a ruled-based algorithm. The extracted concepts are compared with the manual labels, provided by medical experts. Table 4 summarizes the results. On every use case, SKET reaches a weighted F1-score over 0.944, considering the cumulative testing partition. On the single pathology workflows, the lowest performance is reached considering the Catania testing partition of celiac disease data (0.860). Otherwise, the algorithm reaches high-level performance, always over 0.960 in terms of F1-score.

Being effective, SKET can be adopted to mine unlabeled datasets and to annotate large amounts of data, that can be used to train deep learning models. When tested on a Tesla V100 GPU, SKET requires among 0.006 and 0.03 seconds to extract concepts from a report, depending on its length. Considering the worst case scenario, the algorithm is still a thousand times faster than a human expert who needs in the best case scenario 30 s per report. Therefore, the application of SKET leads to save 99.99% of time required in comparison with human experts. For instance, the weak labeling of 10’000 WSIs would require 300’000 seconds (around 83 hours, without breaks) for human experts, in the best case scenario; on the other hand, it would require 300 seconds (five minutes), in the worst case scenario, to SKET.

4.2 Celiac disease

The classification performance of multiple computer vision architectures trained with binary automatically annotated data to classify celiac disease WSIs is as effective as the performance reached by models using manually annotated data.

Tables 5 and 6 summarize the results. The highest performance using manual labels is reached using a ViT architecture (F1-score = 0.914 $\pm$ 0.014 on the test partition), even if on the Catania partition transMIL shows the highest performance. The results are still similar for the three architectures.

Table 5 shows the classification performance obtained using binary manual labels and noisy labels. This experiment aims to investigate general rules for the adoption of automatic labels on the binary classification of WSIs. Considering all the architectures, the performance is similar to the one obtained using manual labels, especially until 10% of training samples are wrongly-annotated, the difference in terms of performance is not statistically significant. When the percentage of wrongly-annotated training is 20% (or more) the performance degrades and the difference, compared with manual labels, is statistically significant, suggesting this percentage of wrongly-annotated labels can be considered as a threshold for the adopting of automatic weak labels in a binary classification scenario.

Table 5: Results on the classification of celiac disease, in terms of F1-score. The performance is evaluated considering three computer vision architectures: CLAM, transMIL, ViT. The architectures are trained with manual weak binary labels and with noisy weak labels, randomly perturbated according to different percentages of noise. The percentage of noisy labels is reported in the ’Noisy Labels’ column, while the accuracy of the labels is reported in terms of F1-score, ’F1 labels’ column. The goal is to evaluate the effect that noisy weak labels have on the binary classification of WSIs. For every setup, the F1-score average and standard deviation of the classification performance are reported, considering the models trained with the 10-fold cross-validation. The setups where the difference is statistically significant in terms of performance (compared with the models trained with manual labels) are marked with an asterisk (*).

Noisy Labels	F1 Labels	Model	Catania	RUMC	Cumulative
Manual	-	CLAM_SB	0.958 $\pm$ 0.009	0.846 $\pm$ 0.023	0.900 $\pm$ 0.012
		transMIL	0.968 $\pm$ 0.009	0.850 $\pm$ 0.019	0.906 $\pm$ 0.010
		ViT	0.953 $\pm$ 0.011	0.877 $\pm$ 0.021	0.914 $\pm$ 0.014
1%	0.977	CLAM_SB	0.954 $\pm$ 0.016	0.849 $\pm$ 0.024	0.900 $\pm$ 0.018
		transMIL	0.968 $\pm$ 0.009	0.864 $\pm$ 0.010	0.914 $\pm$ 0.007
		ViT	0.954 $\pm$ 0.014	0.896 $\pm$ 0.019	0.925 $\pm$ 0.010
2%	0.968	CLAM_SB	0.951 $\pm$ 0.012	0.873 $\pm$ 0.021	0.911 $\pm$ 0.014
		transMIL	0.965 $\pm$ 0.011	0.853 $\pm$ 0.021	0.907 $\pm$ 0.010
		ViT	0.944 $\pm$ 0.017	0.877 $\pm$ 0.021	0.910 $\pm$ 0.013
5%	0.933	CLAM_SB	0.951 $\pm$ 0.019	0.862 $\pm$ 0.019	0.905 $\pm$ 0.017
		transMIL	0.958 $\pm$ 0.012*	0.857 $\pm$ 0.018	0.905 $\pm$ 0.011
		ViT	0.938 $\pm$ 0.026	0.880 $\pm$ 0.026	0.910 $\pm$ 0.020
10%	0.909	CLAM_SB	0.952 $\pm$ 0.013	0.862 $\pm$ 0.023	0.905 $\pm$ 0.017
		transMIL	0.953 $\pm$ 0.026*	0.838 $\pm$ 0.033	0.893 $\pm$ 0.027
		ViT	0.957 $\pm$ 0.014	0.860 $\pm$ 0.023	0.906 $\pm$ 0.014
20%	0.804	CLAM_SB	0.922 $\pm$ 0.026*	0.819 $\pm$ 0.029	0.869 $\pm$ 0.023*
		transMIL	0.933 $\pm$ 0.024*	0.822 $\pm$ 0.013*	0.875 $\pm$ 0.016*
		ViT	0.925 $\pm$ 0.017*	0.834 $\pm$ 0.025*	0.879 $\pm$ 0.017*
50%	0.566	CLAM_SB	0.537 $\pm$ 0.228*	0.450 $\pm$ 0.081*	0.490 $\pm$ 0.145*
		transMIL	0.765* $\pm$ 0.097*	0.502* $\pm$ 0.02*	0.633 $\pm$ 0.041*
		ViT	0.440 $\pm$ 0.302*	0.459 $\pm$ 0.029*	0.480 $\pm$ 0.141*

Table 6 shows the comparison of automatic labels, generated with SKET, and manual labels. The comparison among automatic and manual labels shows a F1-score equal to 0.944, suggesting that the algorithm should lead to performance similar to the one obtained with noisy labels when the percentage of mislabeled data is between 2% and 5%. The results confirm the hypothesis, since the performance is slightly worse than the one obtained using manual labels, but the gap is not statistically significant (according to the Wilcoxon Rank-Sum test, comparing every setup to the one where manual labels are used), showing the effectiveness of automatic labels in a binary classification scenario.

Table 6: Results on the classification of celiac disease, in terms of F1-score. The performance is evaluated considering three computer vision architectures: CLAM, transMIL, ViT. The architectures are trained with automatic and manual weak binary labels, generated extracting meaningful concepts from the corresponding pathology report, using SKET algorithm. The performance of SKET is reported in the ’Train label’ column. The goal is to evaluate the effectiveness of automatic labels on the binary classification of WSIs. For every setup, the F1-score average and standard deviation of the classification performance are reported, considering the models trained with the 10-fold cross-validation. The setups where the difference is statistically significant in terms of performance (compared with the models trained with manual labels) are marked with an asterisk (*).

Noisy Labels	F1 Labels	Model	Catania	RUMC	Cumulative
Automatic	0.944	CLAM_SB	0.948 $\pm$ 0.015	0.857 $\pm$ 0.017	0.901 $\pm$ 0.013
		transMIL	0.960 $\pm$ 0.012	0.845 $\pm$ 0.017	0.900 $\pm$ 0.014
		ViT	0.938 $\pm$ 0.023	0.889 $\pm$ 0.024	0.915 $\pm$ 0.015
Manual	-	CLAM_SB	0.958 $\pm$ 0.009	0.846 $\pm$ 0.023	0.900 $\pm$ 0.012
		transMIL	0.968 $\pm$ 0.009	0.85 $\pm$ 0.019	0.906 $\pm$ 0.010
		ViT	0.953 $\pm$ 0.011	0.877 $\pm$ 0.021	0.914 $\pm$ 0.014

4.3 Lung cancer

The classification performance of multiple computer vision architectures trained with multiclass automatically annotated data to classify lung cancer WSIs is as effective as the performance reached by models using manually annotated data.

Tables 7 and 8 summarize the results. The highest performance using manual labels is reached using a ViT architecture (F1-score = 0.763 $\pm$ 0.012) on both test partitions, dramatically outperforming the other two architectures (CLAM reaches 0.674 $\pm$ 0.016, while transMIL reaches 0.696 $\pm$ 0.016).

Table 7 shows the classification performance obtained using multiclass manual labels and noisy labels. This experiment aims to investigate general rules for the adoption of automatic labels on the multiclass classification of WSIs. Considering all the architectures, the performance is similar to the one obtained using manual labels, especially until 20% of training samples are wrongly-annotated, the difference in terms of performance is not statistically significant. When the percentage of wrongly-annotated training is 50% the performance degrades and the difference, compared with manual labels, is statistically significant, suggesting this percentage of wrongly annotated labels can be considered as a threshold for the adoption of automatic weak labels in a multiclass classification scenario.

Table 7: Results on the classification of lung cancer, in terms of F1-score. The performance is evaluated considering three computer vision architectures: CLAM, transMIL, ViT. The architectures are trained with manual weak multiclass labels and with noisy weak labels, randomly perturbated according to different percentages of noise. The percentage of noisy labels is reported in the ’Noisy Labels’ column, while the accuracy of the labels is reported in terms of F1-score, ’F1 labels’ column. The goal is to evaluate the effect that noisy weak labels have on the multiclass classification of WSIs. For every setup, the F1-score average and standard deviation of the classification performance are reported, considering the models trained with the 10-fold cross-validation. The setups where the difference is statistically significant in terms of performance (compared with the models trained with manual labels) are marked with an asterisk (*).

Noisy Labels	F1 Labels	Model	Catania	RUMC	Cumulative
Manual	-	CLAM_MB	0.617 $\pm$ 0.027	0.717 $\pm$ 0.023	0.674 $\pm$ 0.016
		transMIL	0.635 $\pm$ 0.024	0.745 $\pm$ 0.024	0.696 $\pm$ 0.016
		ViT	0.705 $\pm$ 0.033	0.812 $\pm$ 0.02	0.763 $\pm$ 0.012
1%	0.991	CLAM_MB	0.624 $\pm$ 0.022	0.725 $\pm$ 0.021	0.681 $\pm$ 0.014
		transMIL	0.634 $\pm$ 0.042	0.756 $\pm$ 0.012	0.700 $\pm$ 0.020
		ViT	0.697 $\pm$ 0.035	0.817 $\pm$ 0.018	0.762 $\pm$ 0.021
2%	0.98	CLAM_MB	0.621 $\pm$ 0.034	0.721 $\pm$ 0.016	0.677 $\pm$ 0.018
		transMIL	0.642 $\pm$ 0.033	0.739 $\pm$ 0.011	0.695 $\pm$ 0.017
		ViT	0.698 $\pm$ 0.032	0.807 $\pm$ 0.026	0.757 $\pm$ 0.026
5%	0.957	CLAM_MB	0.609 $\pm$ 0.035	0.715 $\pm$ 0.022	0.670 $\pm$ 0.021
		transMIL	0.622 $\pm$ 0.050	0.743 $\pm$ 0.015	0.687 $\pm$ 0.026
		ViT	0.699 $\pm$ 0.027	0.809 $\pm$ 0.029	0.758 $\pm$ 0.020
10%	0.907	CLAM_MB	0.601 $\pm$ 0.037	0.690 $\pm$ 0.034	0.653 $\pm$ 0.027
		transMIL	0.615 $\pm$ 0.029	0.739 $\pm$ 0.025	0.683 $\pm$ 0.023
		ViT	0.699 $\pm$ 0.026	0.808 $\pm$ 0.018	0.757 $\pm$ 0.015
20%	0.822	CLAM_MB	0.579 $\pm$ 0.060	0.725 $\pm$ 0.038	0.658 $\pm$ 0.042
		transMIL	0.614 $\pm$ 0.039	0.743 $\pm$ 0.017	0.684 $\pm$ 0.018
		ViT	0.702 $\pm$ 0.018	0.808 $\pm$ 0.015	0.759 $\pm$ 0.012
50%	0.561	CLAM_MB	0.409 $\pm$ 0.087*	0.528 $\pm$ 0.069*	0.477 $\pm$ 0.065*
		transMIL	0.483 $\pm$ 0.055*	0.566 $\pm$ 0.027*	0.537 $\pm$ 0.031*
		ViT	0.576 $\pm$ 0.049*	0.701 $\pm$ 0.040*	0.643 $\pm$ 0.038*

Table 8 includes the comparison of automatic labels and manual labels. This comparison represents a real-case scenario of automatic data labeling, where automatic labels are generated by extracting concepts from reports. The comparison among labels shows a F1-score equal to 0.976, suggesting that the algorithm should lead to performance similar to the one obtained in the previous experiment using 2% and 5%. The results confirm the hypothesis, since the performance is slightly worse than the one obtained using manual labels, but the gap is not statistically significant (according to the Wilcoxon Rank-Sum test, comparing every setup to the one where manual labels are used).

Table 8: Results on the classification of lung cancer, in terms of F1-score. The performance is evaluated considering three computer vision architectures: CLAM, transMIL, ViT. The architectures are trained with automatic and manual weak multiclass labels, generated extracting meaningful concepts from the corresponding pathology report, using SKET algorithm. The performance of SKET is reported in the ’Train label’ column. The goal is to evaluate the effectiveness of automatic labels on the multiclass classification of WSIs. For every setup, the F1-score average and standard deviation of the classification performance are reported, considering the models trained with the 10-fold cross-validation. The setups where the difference is statistically significant in terms of performance (compared with the models trained with manual labels) are marked with an asterisk (*).

Noisy Labels	F1 Labels	Model	Catania	RUMC	Cumulative
Automatic	0.976	CLAM_MB	0.623 $\pm$ 0.031	0.705 $\pm$ 0.028	0.67 $\pm$ 0.020
		transMIL	0.620 $\pm$ 0.027	0.740 $\pm$ 0.027	0.686 $\pm$ 0.018
		ViT	0.682 $\pm$ 0.041	0.820 $\pm$ 0.014	0.756 $\pm$ 0.022
Manual	-	CLAM_SB	0.617 $\pm$ 0.027	0.717 $\pm$ 0.023	0.674 $\pm$ 0.016
		transMIL	0.635 $\pm$ 0.024	0.745 $\pm$ 0.024	0.696 $\pm$ 0.016
		ViT	0.705 $\pm$ 0.033	0.812 $\pm$ 0.020	0.763 $\pm$ 0.012

4.4 Colon cancer

The classification performance of multiple computer vision architectures trained with multilabel automatically annotated data to classify colon cancer WSIs is as effective as the performance reached by models using manually annotated data.

Tables 9 and 10 summarize the results. The highest performance using manual labels is reached using a ViT architecture (F1-score = 0.831 $\pm$ 0.009) on both test partitions, dramatically outperforming the other two architectures (CLAM reaches 0.773 $\pm$ 0.015, while transMIL reaches 0.791 $\pm$ 0.008).

Table 9 shows the classification performance obtained using multilabel manual labels and noisy labels. This experiment aims to investigate general rules for the adoption of automatic labels on the multilabel classification of WSIs. Considering all the architectures, the performance is similar to the one obtained using manual labels, especially until 20% of training samples are wrongly-annotated, the difference in terms of performance is not statistically significant. When the percentage of wrongly-annotated training is 50% the performance degrades and the difference, compared with manual labels, is statistically significant, suggesting this percentage of wrongly-annotated labels can be considered as a threshold for the adoption of automatic weak labels in a multilabel classification scenario.

Table 9: Results on the classification of colon cancer, in terms of F1-score. The performance is evaluated considering three computer vision architectures: CLAM, transMIL, ViT. The architectures are trained with manual weak multilabel labels and with noisy weak labels, randomly perturbated according to different percentages of noise. The percentage of noisy labels is reported in the ’Noisy Labels’ column, while the accuracy of the labels is reported in terms of F1-score, ’F1 labels’ column. The goal is to evaluate the effect that noisy weak labels have on the multilabel classification of WSIs. For every setup, the F1-score average and standard deviation of the classification performance are reported, considering the models trained with the 10-fold cross-validation. The setups where the difference is statistically significant in terms of performance (compared with the models trained with manual labels) are marked with an asterisk (*).

Noisy Labels	F1 Labels	Model	Catania	RUMC	Cumulative
Manual	-	CLAM_MB	0.761 $\pm$ 0.015	0.780 $\pm$ 0.017	0.773 $\pm$ 0.015
		transMIL	0.771 $\pm$ 0.015	0.807 $\pm$ 0.007	0.791 $\pm$ 0.008
		ViT	0.824 $\pm$ 0.016	0.837 $\pm$ 0.007	0.831 $\pm$ 0.009
1%	0.988	CLAM_MB	0.761 $\pm$ 0.018	0.776 $\pm$ 0.016	0.771 $\pm$ 0.015
		transMIL	0.772 $\pm$ 0.014	0.810 $\pm$ 0.009	0.793 $\pm$ 0.010
		ViT	0.827 $\pm$ 0.018	0.835 $\pm$ 0.005	0.831 $\pm$ 0.009
2%	0.978	CLAM_MB	0.745 $\pm$ 0.018	0.764 $\pm$ 0.019	0.757 $\pm$ 0.017
		transMIL	0.777 $\pm$ 0.019	0.807 $\pm$ 0.010	0.793 $\pm$ 0.012
		ViT	0.821 $\pm$ 0.019	0.837 $\pm$ 0.005	0.831 $\pm$ 0.009
5%	0.943	CLAM_MB	0.765 $\pm$ 0.018	0.771 $\pm$ 0.021	0.769 $\pm$ 0.018
		transMIL	0.766 $\pm$ 0.013	0.808 $\pm$ 0.009	0.790 $\pm$ 0.008
		ViT	0.819 $\pm$ 0.015	0.835 $\pm$ 0.008	0.828 $\pm$ 0.009
10%	0.898	CLAM_MB	0.767 $\pm$ 0.023	0.777 $\pm$ 0.019	0.774 $\pm$ 0.018
		transMIL	0.768 $\pm$ 0.017	0.805 $\pm$ 0.009	0.789 $\pm$ 0.010
		ViT	0.827 $\pm$ 0.015	0.836 $\pm$ 0.005	0.833 $\pm$ 0.008
20%	0.814	CLAM_MB	0.748 $\pm$ 0.026	0.757 $\pm$ 0.020	0.754 $\pm$ 0.019
		transMIL	0.772 $\pm$ 0.012	0.809 $\pm$ 0.010	0.793 $\pm$ 0.008
		ViT	0.822 $\pm$ 0.020	0.833 $\pm$ 0.003	0.829 $\pm$ 0.009
50%	0.587	CLAM_MB	0.697 $\pm$ 0.042*	0.646 $\pm$ 0.086*	0.670 $\pm$ 0.056*
		transMIL	0.723 $\pm$ 0.027*	0.720 $\pm$ 0.024*	0.721 $\pm$ 0.015*
		ViT	0.811 $\pm$ 0.016*	0.804 $\pm$ 0.021*	0.807 $\pm$ 0.016*

Table 10 includes the comparison of automatic labels and manual labels. This comparison represents a real-case scenario of automatic data labeling, where automatic labels are generated extracting concepts from reports. The comparison among labels shows a F1-score equal to 0.971, suggesting that the algorithm should lead to performance similar to the one obtained in the previous experiment using 2% and 5%. The results confirm the hyphotesis, since the performance are slightly worse than the one obtained using manual labels, but the gap is not statistically significant (according to Wilcoxon Rank-Sum test, comparing every setup to the one where manual labels are used).

Table 10: Results on the classification of colon cancer, in terms of F1-score. The performance is evaluated considering three computer vision architectures: CLAM, transMIL, ViT. The architectures are trained with automatic and manual weak multilabel labels, generated extracting meaningful concepts from the corresponding pathology report, using SKET algorithm. The performance of SKET is reported in the ’Train label’ column. The goal is to evaluate the effectiveness of automatic labels on the multilabel classification of WSIs. For every setup, the F1-score average and standard deviation of the classification performance are reported, considering the models trained with the 10-fold cross-validation. The setups where the difference is statistically significant in terms of performance (compared with the models trained with manual labels) are marked with an asterisk (*).

Noisy Labels	F1 Labels	Model	Catania	RUMC	Cumulative
Automatic	0.971	CLAM_MB	0.761 $\pm$ 0.014	0.771 $\pm$ 0.019	0.767 $\pm$ 0.016
		transMIL	0.759 $\pm$ 0.013	0.801 $\pm$ 0.004	0.783 $\pm$ 0.005
		ViT	0.813 $\pm$ 0.014	0.836 $\pm$ 0.008	0.826 $\pm$ 0.008
Manual	-	CLAM_MB	0.761 $\pm$ 0.015	0.780 $\pm$ 0.017	0.773 $\pm$ 0.015
		transMIL	0.771 $\pm$ 0.015	0.807 $\pm$ 0.007	0.791 $\pm$ 0.008
		ViT	0.824 $\pm$ 0.016	0.837 $\pm$ 0.007	0.831 $\pm$ 0.009

5 Discussion

This paper evaluates the application of weak automatic labels to train computer algorithms on classification.

The application of automatic weak labels would dramatically reduce the time needed to collect samples to train algorithms for the analysis of biomedical data. However, it is not clear under which conditions automatic labels can be adopted to train algorithms.

The results achieved in the paper show that automatic labels are as effective as manual ones, for the classification of WSIs. The first experiments (where manual labels are compared to different percentages of noisy labels) allow to identify some patterns in the algorithm performance. The noise introduced by mislabeled samples (inherently present within automatic labels) impacts the performance of the networks, in terms of accuracy and robustness. The performance achieved using small percentages of noisy labels is still comparable to the ones achieved using the manual labels, until a fixed percentage of mislabeled data: 10% regarding celiac disease (binary labels) and 20% regarding lung and colon cancer (respectively multiclass and multilabel labels). This performance decrease can be explained considering the different natures of labels. Mislabeled samples have a high impact on binary classification, since the label flipping leads to opposite results. Annotation errors are disruptive also in multiclass labels, even if in this case the effect can be smoothed if the errors involve similar classes (already prone to uncertainty). Another explanation for this gap can be identified in the training dataset size. Another relevant parameter to consider when automatic labels are applied is the size of the training dataset, since the effect of mislabeled samples on the training may be compensated by the other samples. In this paper, the celiac disease training dataset includes around 1’000 samples, while instead the lung cancer dataset includes around 2’500 samples and the colon cancer one includes around 6’500. In the celiac disease use case, when the percentage of mislabeled samples is 20% or more, the performance of the architectures is no longer comparable with the one reached using manual labels when the percentage of mislabeled samples is 20%. This result suggests that automatic labels can be adopted when the algorithm used to generate them is accurate. The effect of noisy labels can be also identified on the performance standard deviation: the higher the percentage of noisy labels, the more the three architectures show less robustness.

The architectures trained using automatic labels reach performance comparable (i.e. the performance difference is not statistically significant) with the one reached using manual labels. The results obtained using SKET to generate automatic weak labels show that automatic weak labels can be used to train different architectures on the classification of WSIs. The conditions identified using randomly perturbated noisy data are also tested on a real case scenario, where the automatic labels are generated using SKET, an NLP algorithm to extract meaningful concepts from pathology reports. This set of experiments is necessary to show the application of automatic labels in a real-case scenario, where the likelihood of mislabeling a sample varies. For example, if weak labels are automatically extracted from a report, depending on the report content a sample has more chances to be mislabeled. This characteristic does not apply on the randomly perturbated noisy samples, where every sample can be randomly mislabeled.

The fact that automatic labels are as effective as manual labels opens many perspectives for the computational pathology domain and for the biomedical domain in general. Automatic labels limit the need for medical experts to annotate data, which can save up to 99.99% of time otherwise needed to analyze reports in order to infer labels. Therefore, a dataset including around 10’000 can be weakly-annotated in around five minutes. Considering the fact that every year a large amount of biomedical data is produced and only a small percentage is annotated, this would allow to exploit a vast amount of data, that can be used to build more accurate and robust models, helping medical experts to diagnose diseases more effectively.

6 Conclusions

The application of automatic labels may help to exploit vast amounts of unlabeled biomedical samples to train more robust models, reducing by 99.99% the time needed to collect weakly-annotated samples. However, is still not clear when this kind of labels is effective. This paper evaluates the performance of different percentages of noisy labels (1,2,5,10,20,50%) and compares the results with the performance obtained by the same architectures, but using manual weak labels, provided by medical experts. After the identification of some rules (e.g. training datasets with 10% of mislabeled samples lead to performance comparable to the one obtained using manual labels), SKET, an algorithm to extract meaningful concepts from reports, is used to generate automatic weak labels. The performance reached by the models trained with SKET labels is comparable (not statistically significant difference) to the one obtained with manual labels, showing the effectiveness of automatic labels. The result can allow to annotate samples contained in hospitals without the need of human efforts, paving the way to more and more accurate algorithms. The code including the implement of the computer vision algorithms to classify WSIs is publicly available on Github (https://github.com/ilmaro8/wsi_analysis).

Acknowledgments

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 825292 (ExaMode, htttp://www.examode.eu/).

References

Van der Laak et al. [2021] Jeroen Van der Laak, Geert Litjens, and Francesco Ciompi. Deep learning in histopathology: the path to the clinic. Nature medicine, 27(5):775–784, 2021.
De Matos et al. [2021] Jonathan De Matos, Steve Tsham Mpinda Ataky, Alceu de Souza Britto Jr, Luiz Eduardo Soares de Oliveira, and Alessandro Lameiras Koerich. Machine learning methods for histopathological image analysis: A review. Electronics, 10(5):562, 2021.
Gurcan et al. [2009] Metin N Gurcan, Laura E Boucheron, Ali Can, Anant Madabhushi, Nasir M Rajpoot, and Bulent Yener. Histopathological image analysis: A review. IEEE reviews in biomedical engineering, 2:147–171, 2009.
Krupinski et al. [2013] Elizabeth A Krupinski, Anna R Graham, and Ronald S Weinstein. Characterizing the development of visual search expertise in pathology residents viewing whole slide images. Human pathology, 44(3):357–364, 2013.
Fraggetta et al. [2017] Filippo Fraggetta, Salvatore Garozzo, Gian Franco Zannoni, Liron Pantanowitz, and Esther Diana Rossi. Routine digital pathology workflow: the catania experience. Journal of pathology informatics, 8(1):51, 2017.
Fraggetta et al. [2021] Filippo Fraggetta, Vincenzo L’imperio, David Ameisen, Rita Carvalho, Sabine Leh, Tim-Rasmus Kiehl, Mircea Serbanescu, Daniel Racoceanu, Vincenzo Della Mea, Antonio Polonia, et al. Best practice recommendations for the implementation of a digital pathology workflow in the anatomic pathology laboratory by the european society of digital and integrative pathology (esdip). Diagnostics, 11(11):2167, 2021.
Merchant and Castleman [2022] Fatima Merchant and Kenneth Castleman. Microscope image processing. Academic press, 2022.
Hewer [2020] Ekkehard Hewer. The oncologist’s guide to synoptic reporting: a primer. Oncology, 98(6):396–402, 2020.
Hanna et al. [2020] Matthew G Hanna, Victor E Reuter, Orly Ardon, David Kim, Sahussapont Joseph Sirintrapun, Peter J Schüffler, Klaus J Busam, Jennifer L Sauter, Edi Brogi, Lee K Tan, et al. Validation of a digital pathology system including remote review during the covid-19 pandemic. Modern Pathology, 33(11):2115–2127, 2020.
Madabhushi and Lee [2016] Anant Madabhushi and George Lee. Image analysis and machine learning in digital pathology: Challenges and opportunities. Medical image analysis, 33:170–175, 2016.
Litjens et al. [2022] Geert Litjens, Francesco Ciompi, and Jeroen van der Laak. A decade of gigascience: The challenges of gigapixel pathology images. GigaScience, 11, 2022.
Xu et al. [2023] Hongming Xu, Qi Xu, Fengyu Cong, Jeonghyun Kang, Chu Han, Zaiyi Liu, Anant Madabhushi, and Cheng Lu. Vision transformers for computational histopathology. IEEE Reviews in Biomedical Engineering, 2023.
Cifci et al. [2023] Didem Cifci, Gregory P Veldhuizen, Sebastian Foersch, and Jakob Nikolas Kather. Ai in computational pathology of cancer: improving diagnostic workflows and clinical outcomes? Annual Review of Cancer Biology, 7:57–71, 2023.
Campanella et al. [2019] Gabriele Campanella, Matthew G Hanna, Luke Geneslaw, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J Busam, Edi Brogi, Victor E Reuter, David S Klimstra, and Thomas J Fuchs. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature medicine, 25(8):1301–1309, 2019.
Abels et al. [2019] Esther Abels, Liron Pantanowitz, Famke Aeffner, Mark D Zarella, Jeroen van der Laak, Marilyn M Bui, Venkata NP Vemuri, Anil V Parwani, Jeff Gibbs, Emmanuel Agosto-Arroyo, et al. Computational pathology definitions, best practices, and recommendations for regulatory guidance: a white paper from the digital pathology association. The Journal of pathology, 249(3):286–294, 2019.
Chen et al. [2022] Richard J Chen, Chengkuan Chen, Yicong Li, Tiffany Y Chen, Andrew D Trister, Rahul G Krishnan, and Faisal Mahmood. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16144–16155, 2022.
Karimi et al. [2020] Davood Karimi, Haoran Dou, Simon K Warfield, and Ali Gholipour. Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis. Medical image analysis, 65:101759, 2020.
Deng et al. [2020] Shujian Deng, Xin Zhang, Wen Yan, Eric I Chang, Yubo Fan, Maode Lai, Yan Xu, et al. Deep learning in digital pathology image analysis: a survey. Frontiers of medicine, 14(4):470–487, 2020.
Carbonneau et al. [2018] Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognition, 77:329–353, 2018.
Ilse et al. [2018] Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based deep multiple instance learning. In International conference on machine learning, pages 2127–2136. PMLR, 2018.
Wang et al. [2019] Yun Wang, Juncheng Li, and Florian Metze. A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 31–35. IEEE, 2019.
Lu et al. [2021] Ming Y Lu, Drew FK Williamson, Tiffany Y Chen, Richard J Chen, Matteo Barbieri, and Faisal Mahmood. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature biomedical engineering, 5(6):555–570, 2021.
Hashimoto et al. [2020] Noriaki Hashimoto, Daisuke Fukushima, Ryoichi Koga, Yusuke Takagi, Kaho Ko, Kei Kohno, Masato Nakaguro, Shigeo Nakamura, Hidekata Hontani, and Ichiro Takeuchi. Multi-scale domain-adversarial multiple-instance cnn for cancer subtype classification with unannotated histopathological images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3852–3861, 2020.
Marini et al. [2022] Niccolò Marini, Stefano Marchesin, Sebastian Otálora, Marek Wodzinski, Alessandro Caputo, Mart Van Rijthoven, Witali Aswolinskiy, John-Melle Bokhorst, Damian Podareanu, Edyta Petters, et al. Unleashing the potential of digital pathology data by training computer-aided diagnosis models without human annotations. NPJ digital medicine, 5(1):102, 2022.
Marchesin et al. [2022] Stefano Marchesin, Fabio Giachelle, Niccolò Marini, Manfredo Atzori, Svetla Boytcheva, Genziana Buttafuoco, Francesco Ciompi, Giorgio Maria Di Nunzio, Filippo Fraggetta, Ornella Irrera, et al. Empowering digital pathology applications through explainable knowledge extraction tools. Journal of pathology informatics, 13:100139, 2022.
Shao et al. [2021] Zhuchen Shao, Hao Bian, Yang Chen, Yifeng Wang, Jian Zhang, Xiangyang Ji, et al. Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Advances in neural information processing systems, 34:2136–2147, 2021.
Caio et al. [2019] Giacomo Caio, Umberto Volta, Anna Sapone, Daniel A Leffler, Roberto De Giorgio, Carlo Catassi, and Alessio Fasano. Celiac disease: a comprehensive current review. BMC medicine, 17:1–20, 2019.
Lebwohl and Rubio-Tapia [2021] Benjamin Lebwohl and Alberto Rubio-Tapia. Epidemiology, presentation, and diagnosis of celiac disease. Gastroenterology, 160(1):63–75, 2021.
Schabath and Cote [2019] Matthew B Schabath and Michele L Cote. Cancer progress and priorities: lung cancer. Cancer epidemiology, biomarkers & prevention, 28(10):1563–1579, 2019.
Organization [2023] World Health Organization. Lung cancer. Online, 2023. URL https://www.who.int/news-room/fact-sheets/detail/lung-cancer. Accessed: 2024-04-25.
Travis [2011] William D Travis. Pathology of lung cancer. Clinics in chest medicine, 32(4):669–692, 2011.
Benson et al. [2018] Al B Benson, Alan P Venook, Mahmoud M Al-Hawary, Lynette Cederquist, Yi-Jen Chen, Kristen K Ciombor, Stacey Cohen, Harry S Cooper, Dustin Deming, Paul F Engstrom, et al. Nccn guidelines insights: colon cancer, version 2.2018. Journal of the National Comprehensive Cancer Network, 16(4):359–369, 2018.
Marini et al. [2021a] Niccolo Marini, Manfredo Atzori, Sebastian Otálora, Stephane Marchand-Maillet, and Henning Müller. H&e-adversarial network: a convolutional neural network to learn stain-invariant features through hematoxylin & eosin regression. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 601–610, 2021a.
Marini et al. [2023] Niccolò Marini, Sebastian Otalora, Marek Wodzinski, Selene Tomassini, Aldo Franco Dragoni, Stephane Marchand-Maillet, Juan Pedro Dominguez Morales, Lourdes Duran-Lopez, Simona Vatrano, Henning Müller, et al. Data-driven color augmentation for h&e stained images in computational pathology. Journal of Pathology Informatics, 14:100183, 2023.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Sharir et al. [2021] Gilad Sharir, Asaf Noy, and Lihi Zelnik-Manor. An image is worth 16x16 words, what is a video worth? arXiv preprint arXiv:2103.13915, 2021.
Han et al. [2020] Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on visual transformer. arXiv preprint arXiv:2012.12556, 2020.
Menotti et al. [2023] Laura Menotti, Gianmaria Silvello, Manfredo Atzori, Svetla Boytcheva, Francesco Ciompi, Giorgio Maria Di Nunzio, Filippo Fraggetta, Fabio Giachelle, Ornella Irrera, Stefano Marchesin, et al. Modelling digital health data: The examode ontology for computational pathology. Journal of Pathology Informatics, 14:100332, 2023.
Neumann et al. [2019] Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. Scispacy: fast and robust models for biomedical natural language processing. arXiv preprint arXiv:1902.07669, 2019.
Mikolov et al. [2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
Marini et al. [2021b] Niccolò Marini, Sebastian Otálora, Damian Podareanu, Mart van Rijthoven, Jeroen van der Laak, Francesco Ciompi, Henning Müller, and Manfredo Atzori. Multi_scale_tools: a python library to exploit multi-scale whole slide images. Frontiers in Computer Science, 3:684521, 2021b.
Janowczyk et al. [2019] Andrew Janowczyk, Ren Zuo, Hannah Gilmore, Michael Feldman, and Anant Madabhushi. Histoqc: an open-source quality control tool for digital pathology slides. JCO clinical cancer informatics, 3:1–7, 2019.
Junczys-Dowmunt et al. [2018] Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, et al. Marian: Fast neural machine translation in c++. arXiv preprint arXiv:1804.00344, 2018.
Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
Buslaev et al. [2020] Alexander Buslaev, Vladimir I Iglovikov, Eugene Khvedchenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A Kalinin. Albumentations: fast and flexible image augmentations. Information, 11(2):125, 2020.
Woolson [2007] Robert F Woolson. Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials, pages 1–3, 2007.
Goode et al. [2013] Adam Goode, Benjamin Gilbert, Jan Harkes, Drazen Jukic, and Mahadev Satyanarayanan. Openslide: A vendor-neutral software foundation for digital pathology. Journal of pathology informatics, 4(1):27, 2013.