1 Introduction

Fig. 1
figure 1

Concept: A synthetic dataset generator creates a dataset with a given difficulty D. A subset with N images is sampled from this dataset. On this subset, we train a DL model and CIPP and compare the performances of both approaches. The performance is measured using the F1-score. This allows us to estimate the “Break-Even-Point” (BEP), up to which the CIPP is still able to outperform DL. In the end, we aggregate the experimental results of each dataset to define the specifications for the usability of CIPPs over DL

Semantic segmentation is a crucial task in computer vision, widely used in fields like autonomous driving, medical tissue evaluation, and remote sensing image analysis. Deep learning (DL) methods, including convolutional neural networks (CNN) [1,2,3] and visual transformers (ViT) [4], have become the preferred approach to solve this type of problem due to their outstanding performance.

DL approaches are adaptive and easily applicable to a wide range of tasks, with little effort. Consequently, they have become the go-to solution for this type of problem, while conventional image processing techniques, such as Thresholding, Watershed, Active Contour, (Super) Pixel Classification and Handcrafted Features, are often overlooked. Nevertheless, there are still automated and sophisticated conventional image processing pipelines (CIPPs) [5,6,7,8] available.

DL methods, however, have their downsides as well. The training process for DL involves representation learning and requires a significant amount of computational resources. Although researchers are currently exploring interpretability and explainability in DL [9, 10], the available methods, such as class activation maps and gradient analysis, are only applicable for image classification.

In contrast, CIPP approaches excel in areas such as computational complexity, inference speed, and explainability. The decision process of a CIPP can be easily analyzed by executing and visualizing each step separately, as the CIPP consists of many understood steps. CIPPs can be used especially if the problem at hand is easy to solve or an efficient and simple solution is needed [11]. An expert can inject implicit knowledge into a CIPP, reducing the amount of information that needs to be learned. Therefore, CIPPs can be successful when few data points or computational resources are available [12].

These properties of DL and CIPP show the potential of both approaches and their ability to complement each other when applied at the right time and scenario. The general consensus states that DL performs best on large and diverse datasets while CIPPs are applied to small and easy datasets. Studies comparing DL and conventional image processing in the field of image classification [13,14,15,16,17,18] or semantic segmentation [17, 19,20,21,22,23,24,25,26,27] show that DL methods consistently exceed or at least match the the performance of conventional techniques. All these comparisons where performed on individual datasets not evaluating the underlying dataset properties. A neutral and systematic evaluation of the applicability of CIPPs and DL in relation to the properties of semantic segmentation datasets and guidelines for application are currently missing.

In this paper, we aim to address this gap by analyzing the strengths and weaknesses of DL models compared to CIPPs in terms of dataset properties. We introduce an automatically optimized conventional image processing pipeline, which is as easy to apply to a problem as a DL method, and provide a novel synthetic dataset generator enabling us to conduct experiments and investigate the behavior of DL and CIPP for various difficulties and different numbers of images. The benchmark dataset supports different tunable noises to increase the difficulty. Additionally, we evaluate different dataset sizes with respect to the influences of stochastic errors and heterogeneous errors in training and testing. Finally, we provide guidelines for choosing the appropriate algorithm (CIPP/DL) based on the characteristics of the dataset and problem.

2 Concept

In this paper, we conduct a study on the performance of DL and CIPP approaches for semantic segmentation to discover effects that let CIPP perform better than DL. The concept of the study is shown in Fig. 1. We focus specifically on the impact of the amount of training data and the difficulty of the task.

Therefore, we introduce a synthetic dataset generator which enables us to quantify and isolate the properties of a semantic segmentation task. Synthetic data is generated with a clearly defined difficulty D. From each dataset, we randomly draw a number of images N and train with this subset a DL model and a CIPP. For each subset with difficulty D and number of images N, we can compare the performance of the DL model and the CIPP and determine the ”Break-Even-Point” (BEP) for each dataset. We expect the CIPP to perform well on easy datasets when there are few training images provided. To confirm this hypothesis, we aggregate the results over all datasets to specify the area of usability where a CIPP outperforms DL in relation to the number of training samples and the difficulty of the dataset.

3 Synthetic dataset

To generate synthetic datasets for the comparison of semantic segmentation approaches, we model an image generation pipeline as depicted in Fig. 2. Each generated dataset contains \(\hat{N}\) unique images with \(\hat{N}_{\textrm{train}}=512\) in the train set and \(\hat{N}_{\textrm{test}}=512\) in the test set. An image \(I_i\) with \(i \in [1, \hat{N}]\) is a square with image height and width \(s_{\textrm{img}}=400px\) and three RGB color channels with a corresponding binary label map \(L_i\) of the same size. In the images and their respective label maps, we place an elliptical object on top of heterogeneous structures that constitute our background. The object and background are slightly altered, e.g., texture on the object and different background colors, to ensure a baseline difficulty for our segmentation task. Subsequently, different types of noise are added with a defined rate D to increase the difficulty further.

In detail, the images are generated as illustrated in Fig. 2 using the following steps:

Fig. 2
figure 2

The data generation process: we start with an empty frame (top left) and create a background (Gaussian Blobs) on the frame, before the object (ellipse) to be identified is inserted on top. As specified by the user, three noise types (Blurring, Salt-n-Pepper, Color-Shift) are applied

Create background: The background is drawn first and covers the entire image \(I_i\) with the purpose of giving the segmentation problem a baseline difficulty. In this study, we used a background consisting of 50–200 randomly generated Gaussian distributions. The color of all the blobs in a single image \(I_i\) is randomly chosen from the candidates brown, purple, and teal, which all differ from the color of the object to identify (added in the next step).

Insert object: Then an elliptical object is placed in a random position of the image \(I_i\) and the respective label map \(L_i\). Here, we use a green ellipse that has a salt-n-pepper texture and varies slightly in shape, color, and degree of texture.

Apply noise: Noise is added last to an image \(I_i\) and applied to both the background and the object. The user defines the noise difficulty \(D_{\textrm{Noise}} \in [0\%, 100\%]\) which determines the diversity and maximum strength of the applied noise for the entire data set. The exact degree of noise applied to an individual image \(I_i\) is defined by the noise parameter \(g_{{\textrm{Noise}}, i}\) that is sampled from an interval \(G_{\textrm{Noise}}\) as shown in Fig. 3. To be precise, the noise parameter \(g_{{\textrm{Noise}}, i}\) for an image \(I_i\) is sampled uniformly from the interval \(G_{\textrm{Noise}}\), which is defined as follows:

$$\begin{aligned} G_{\textrm{Noise}} = [g^{\min }_{\textrm{Noise}}, D_{\textrm{Noise}} \cdot g^{\max }_{\textrm{Noise}}]. \end{aligned}$$
(1)

The lower limit of the interval \(G_{\textrm{Noise}}\) is defined by the minimum possible noise parameter \(g^{\min }_{\textrm{Noise}}\) and the upper limit is defined by the maximum possible noise parameter \(g^{\max }_{\textrm{Noise}}\) scaled with the defined difficulty \(D_{\textrm{Noise}}\).

Fig. 3
figure 3

The noise difficulty \(D_{\textrm{Noise}}\) is set by the user for the whole dataset and defines the upper limit of the interval \(G_{\textrm{Noise}}\). The noise parameter \(g_{{\textrm{Noise}}, i}\) is then uniformly sampled from the interval \(G_{\textrm{Noise}}\) and applied to the image \(I_i\). This is repeated for all images in the dataset

This sampling process ensures that the noise difficulty \(D_{\textrm{Noise}}\) defines the diversity of noise and the maximum amount of noise applied. The concept of applying a varying degree of noise to every generated image is inspired by real-world applications where some samples are easier to identify, while others are noisier. \(D_{\textrm{Noise}}=0\%\) means that no additional noise is added to a dataset, but the properties of the object, as well as the background, still differ between images, which constitutes a baseline difficulty for our synthetic dataset. By increasing the difficulty of noise \(D_{\textrm{Noise}}\), a larger interval \(G_{\textrm{Noise}}\) of noise parameters is covered, thus raising the overall level and diversity of noise in a dataset. The specific noise options are the following:

  • Blurring: A normalized box filter is applied to the image, thus blurring the object to identify. The noise parameter corresponds to the size of the kernel \(g^{\min }_{\textrm{BL}} = 0\) and \(g^{\max }_{\textrm{BL}} = 400px\) as the maximum image side \(s_{\textrm{img}}\).

  • Salt-n-pepper: For each pixel, a random value is generated, which is added or subtracted from the original pixel value. The noise parameter limits the maximum pixel value that can be generated with \(g^{\min }_{\textrm{SNP}} = 0\) and \(g_{\textrm{SNP}}^{\max } = 255\).

  • Color-shift: For each channel, a random value is generated which is added or subtracted from the original channel. The noise parameter corresponds to the value added with \(g_{\textrm{CS}}^{\min } = 0\) and \(g_{\textrm{CS}}^{\max }= 255\).

In real-world applications, the three types of noise are influenced by various properties of the recording device, such as the employed optics or the resolution of the detector, and therefore not directly related to each other. Consequently, a general parameter D to describe the degree of noise in a dataset can be calculated as the mean of the individual difficulties:

$$\begin{aligned} D = \frac{1}{3} (D_{\textrm{BL}} + D_{\textrm{SNP}} + D_{\textrm{CS}}). \end{aligned}$$
(2)

To simplify matters, we generate our synthetic dataset using equal noise levels for all types, e.g., \(D = 5\% = D_{\textrm{BL}} = D_{\textrm{SNP}} = D_{\textrm{CS}}\).

Fig. 4
figure 4

Five randomly generated images of the baseline dataset with an overall difficulty \(D=0\%\) (baseline). The difficulty D is then increased by applying noise

In conclusion, the generation pipeline produces pairs of RGB images and binary label maps with elliptical objects for the purpose of semantic segmentation. The elliptical objects exhibit a textured surface and vary slightly, but differ from the blurred background in their sharp edges and color. Figure 4 presents examples of a dataset with \(D=0\%\), the baseline difficulty. By increasing the level of noise, the edges of the objects are blurred, the texture is added across the entire image, and the colors of the total image are shifted, complicating the segmentation task. The code to create synthetic datasets can be found here: https://github.com/FMuenke/synthetic-dummy-dataset.

4 Semantic segmentation models

Fig. 5
figure 5

Optimization process of a CIPP. The order of operations is set by the user and each operation has a predefined set of parameters associated with it. During the optimization process, the optimal parameters are determined based on the provided training data by grid-search. All provided images are processed with all possible parameter combinations and finally the set of parameters with the highest F1-score on the training data is picked

Conventional image processing relies on simple operations such as thresholding, edge-detection, or morphological operations, where each operation can be specified with individual parameters. We define a CIPP model as a static sequence of conventional image processing operations. As depicted in Fig. 5, our implementation of a CIPP model provides a framework for an expert to stack these operations without manually setting parameters. Each operation has a pre-defined set of parameters. In this paper we select the best parameters by running all available training images through all possible combinations of parameterized pipelines (grid-search) and selecting the sequence of parameters with the best performance on the training data. Our framework provides besides grid-search other optimization strategies as random search or genetic algorithm.

The CIPP model is specifically designed to use simple techniques to ensure intuitive application to a problem, explainable results, and fast inference even when few data points and computational resources are available. The strengths of CIPP are only useful when they are as easy to apply to a problem as DL. Thus, we have created an easily installable Python package to enable the simple use of CIPPs.

The CIPP is designed to solve the synthetic data set presented in Sect. 3. The segmentation target features two distinct attributes: salt-n-pepper texture and bright green color, which are detectable with edge-detection and thresholding. The CIPP used is visualized in Fig. 6. We aim to increase the processing speed by reducing the image size to 200px x 200px and only applying the CIPP to the green channel. Afterward, the CIPP has the option to apply blurring of different scales to the image to remove noise. The following inversion operation enables the CIPP to select whether the image should be inverted from maximum to minimum. Segmentation is performed by applying Thresholding, Otsu-Thresholding [28] or Edge-Detection. The segmentation mask is post-processed by applying Closing and Eroding to the segmentation. Further details on the image processing operations are found in Tab. 1. The implementation of the CIPP can be found here https://github.com/FMuenke/cipp

In the domain of image segmentation, the U-Net [1] is a prominently used neural network model [29,30,31,32] that we employ as our representative for DL. We use the implementation from [33]. The hyperparameters for training the U-Net were determined through a brief random searchFootnote 1 to fit the synthetic dataset. The final parameters are the following:

  • Input size: 256 \(\times \) 256,

  • Backbone: ResNet18 [34],

  • Loss: Dice,

  • Optimizer: Adam, Learning rate: \(10^{-5}\),

  • Early Stopping after 100 Epochs without improving the validation loss,

  • Learning Rate Scheduling (factor 0.5 after 50 epochs),

  • Augmentations: horizontal/vertical flip, rotation, cropping.

Fig. 6
figure 6

Structure of the CIPP: An image \(I_i\) is resized to 200px \(\times \) 200px and the green channel is selected for further processing. Afterward Blurring and Inversion are used as pre-processing steps. The target is segmented by applying Thresholding, Otsu-Thresholding or Edge-Detection to the image. Finally Closing and Eroding are used to post-process the output. (Static operations are defined by the user and do not contain variable parameters. Dynamic operations have variable parameters which are optimized during the training process)

During the random search, it became evident that a batch size of 8 significantly (+30% F1-score) improved performance compared to a batch size of 1. When training with a few images, the batch size is set to the maximum number of images until a batch size of 8 is reached.

During training, the only augmentation techniques used are horizontal/vertical flips, rotation, and cropping, since the synthetic dataset already uses salt-n-pepper noise, blurring, and channel shift to increase difficulty. These augmentation techniques are not useful for a CIPP model and thus are not used during their training.

For each set of N training images, we select the same number of additional validation images. These images are used to evaluate the performance of the DL model during training. As the final DL model, we choose the best performing model on the validation dataset. Transfer-learning utilizing pretrained weights is a common strategy to improve data efficiency. Thus, we are considering the baseline U-Net as described (U-Net-R18) and the same U-Net with an encoder pretrained on Imagenet [35] (U-Net-R18-I) in our experiments.

5 Results

5.1 Overview

Fig. 7
figure 7

Example Images from our dataset for the difficulties \(D = 5\%\), \(20\%\) and \(50\%\), as introduced in Sect. 3

We train three types of models in our experiments as introduced in Sect. 4. Each model is trained on a synthetic dataset, which covers all types of noise (blurring, salt-n-pepper, and color-shift) simultaneously. This dataset increases its difficulty D by raising the separate noise difficulties \(D_{\textrm{BL}}\), \(D_{\textrm{CS}}\), and \(D_{\textrm{SNP}}\) equally, as shown in Fig. 7. The difficulties \(0\%\) to \(50\%\) in steps of \(5\%\) and additionally \(100\%\) are evaluated.

We train with different numbers of training images \(N=\{4, 8, 16, 32, 64, 128\}\) for each difficulty D. N corresponds only to the number of images used to train. Since the U-Net models require validation data to determine the optimal time to stop training, we always supply the U-Nets with an equal number of validation images in parallel to the number of training images N.Footnote 2 We test CIPP and DL on each subset and compare their F1-score on the full test set. Since the images are selected randomly, we repeat each training 20 times to reduce the random deviation introduced by the initialization of the U-Net and the choice of training images. During the sampling of images, we ensure that the approaches are both trained on the same images by setting the random seed (e.g., the first iteration of CIPP is trained on identical images as the first iteration of the U-Net models). The difficulty, as described in Sect. 3, represents the strength of applied noise, as well as the diversity of the data set.

Fig. 8
figure 8

The performance matrix of the U-Nets and the CIPP for all difficulties D and the number of training images N. The performance is measured using the average F1-score

5.2 Baseline U-Net

The average results of the U-Net-R18 on our dataset are displayed in Fig. 8. We can see that the U-Net-R18 performs well on the difficulties \(D \le 5\%\) regardless of the amount of training images with performance over 93% F1-score. As expected the performance starts to decrease with an increase in difficulty and the performance increases with an increase in the amount of training images. The U-Net-R18 is still able to reach a performance of above 73% even for higher difficulties \(D \le 50\%\) provided enough training images. Only for the difficulty \(D=100\%\) the U-Net-R18 is not able to learn adequate filters for the segmentation task and cannot exceed 6% F1-score.

5.3 Pretrained U-Net

Figure 8 presents the average results of U-Net-R18-I, which was pretrained on ImageNet. Our findings indicate that the performance of U-Net-R18-I is closely aligned with that of U-Net-R18. Specifically, as the level of difficulty increases, the performance of both models decreases, while increasing the number of training images improves their performance. However, we observed that U-Net-R18-I performs better than U-Net-R18 by an average of 0.56% across all combinations of difficulties and training images. Notably, the performance gap between the two models is generally below 9%, and the majority of differences larger than 3% occur when the number of training images is less than 16. Our experiments also demonstrate that the effect of pretrained weights on model performance in this scenario is negligible. We assume that this could be attributed to the fact that the pretrained weights available are not specifically tailored to the domain they are being applied to.

5.4 CIPP

We assess the performance of CIPP and present the results in Fig. 8. Unlike the U-Nets, the CIPP is less sensitive to the number of training images. We observed that increasing the number of training images from \(N=16\) to \(N=128\) leads to a maximum improvement of 7% for all difficulty levels. Notably, the performance gain is more pronounced when the number of training images is increased from \(N=4\) to \(N=16\), with an average improvement of around 11%. Although the CIPP’s performance decreases as difficulty increases, it still maintains a relatively high performance level of 26% even at the highest difficulty level of 100%.

5.5 Comparison

Fig. 9
figure 9

Exemplary results for different difficulties over the number of training images N

We conducted a side-by-side comparison of the three models, evaluating their performance at three different difficulty levels, as shown in Fig. 9. Rather than presenting only the average performance, we provide the results of all 20 experimental runs, which enables us to observe the variation in performance for different numbers of training images N. The results indicate that the variation decreases as the number of training images N increases for all models. Moreover, we observed that the deviation between separate runs increases clearly as the difficulty level of the dataset increases for both U-Nets.

Fig. 10
figure 10

Comparison of the U-Net-R18 and the U-Net-R18-I with the CIPP. Where the differences of the performance matrices between CIPP and the U-Nets are visualized. (Positive Values: U-Nets outperforms, Negative Values: CIPP outperforms)

In Fig. 10, we compare the average performances of the three models by subtracting the performance matrix of the CIPP from those of the U-Nets. This yields a matrix that highlights the differences between the U-Nets and the CIPP. A positive value indicates superior performance by the U-Nets, while a negative value indicates superior performance by the CIPP. We observed that both matrices are similar, as the performances of the U-Nets are comparable. The CIPP outperforms the U-Nets at \(N=4\) and \(D=25\%\). With increasing difficulty, all models exhibit a drop in performance, but the CIPP maintains a more stable performance. Further, the CIPP is able to outperform the U-Nets at \(D=50\%\) even for \(N=32\) training images. At the highest difficulty level of 100%, the CIPP performs better across all numbers of training images.

Overall, the CIPP exhibits a more stable and consistent performance than the U-Nets, and is less affected by changes in dataset difficulty and the number of training images. Additionally, the spread of the results from the 20 distinct test runs is more stable for the CIPP than for the U-Nets at higher difficulty levels, as seen in Fig. 9. It is worth noting that the U-Nets exhibit outstanding performance for a small number of training images, particularly for difficulties \(D \le 15\%\). Our suspicion is that the U-Nets are capable of fitting the provided data due to the limited diversity of the dataset and the fact that the validation images closely resemble the general dataset.

When comparing the inference speed of DL and CIPP on a Mac Book with a 2,3 GHz Quad-Core Intel Core i7 processor, the DL approach is able to process 2.36 images per second compared to 62.1 images per second for the CIPP. This makes CIPP especially relevant for devices with low computational capacities, such as microcontrollers.

5.6 Transferability

The results presented in this paper are derived from synthetic data, raising the question of whether these findings can be extrapolated to real-world datasets. In the domain of biomedical image processing, datasets often show high diversity within and between datasets, and are typically limited in size. Our research suggests that datasets sharing similar inherent features yields comparable results to those obtained from our synthetic dataset. We have evaluated the effectiveness of the CIPP on four real-world dataset in Appendix 1. The LIVECell dataset [36] and the DOORS dataset [37] exhibit significant diversity between their training and testing subsets. This diversity leads to the anticipated superiority of the CIPP over the U-Net-R18-I. In the case of the Derma ISIC dataset [38], both models demonstrate comparable performance, owing to the dataset’s relatively limited diversity. Conversely, on the CryoNuSeg dataset [39], the CIPP exhibits a comparatively inferior performance due to the limited diversity among segmentation targets.

6 Conclusions

So far, there is no comprehensive study, comparing conventional image processing to modern deep learning algorithms considering dataset specific properties. Thus, we introduced a synthetic dataset with tunable degrees of difficulty and conducted a exhaustive study on DL approaches and our own easy-to-apply implementation of a CIPP. The dataset serves as a versatile benchmark dataset and will be used for future studies as well. Furthermore, it can be used to educate students and researchers in understanding and comparing the performance of semantic segmentation approaches.

Our findings show that DL performs best on tasks with low difficulty/diversity and large amounts of training data. Deep learning is able to consider context and shapes which makes it effective in recognizing the target even with few training images. However, if only a few training images are provided, the diversity of the dataset is not properly represented, leading to decreased DL performance. In such cases, the CIPP is able to generalize better due to human expert input and limited parameter space to optimize.

Overall, we recommend the use of our implementation of a CIPP in all scenarios due to its ease of application and low resource requirements. Our proposed CIPP implementation can work with the same data format as most DL frameworks, reducing the additional effort required for adoption. Additionally, CIPPs allow for easy understanding and adaptation of the processing pipeline to new data, making them useful in laboratory settings with few experimental modalities that require quick adaptation with minimal computational costs. Finally, the CIPP can also be used to post-process outputs of DL approaches by removing artifacts or supporting the labeling process by providing quickly label-maps, which can be corrected by a human operator.

Our study highlights the importance of understanding the strengths and weaknesses of both deep learning methods and conventional image processing pipelines. Researchers and practitioners can use this knowledge to choose the most appropriate approach for their specific task and dataset, based on the available resources and desired performance metrics.

In our future research, we plan to expand the capabilities of our CIPP implementation and assess its ability to assist human annotators in fast and efficient pre-labeling. Specifically, we aim to enhance our CIPP with additional image processing techniques and optimize its performance on various types of image datasets. Additionally, we will investigate the potential of our CIPP to be used in combination with DL methods to further improve semantic image segmentation accuracy. We will also explore the possibility of integrating our CIPP into existing annotation tools to facilitate the labeling process for human annotators.