1 Introduction

Celiac disease (CD) [1] is a common autoimmune disorder primarily affecting the small bowel. It is characterized by an inflammation affecting the mucosa of the duodenum and the pars descendens. During the course of the disease, the mucosa loses its absorptive villi and hyperplasia of the enteric crypts occurs, which leads to a diminished ability to absorb nutrients (Fig. 1). In the past, a significant amount of research has been performed in the field of computer aided CD diagnosis from endoscopic imaging data [2,3,4,5,6] with the target to provide a second opinion besides histological assessment of biopsies and/or to reduce the number of required biopsies [7]. In early literature, the focus in this field was mainly on handcrafted image representations including methods such as local binary patterns and mid-level representations such as Fisher vectors [3, 7]. Recently, convolutional neural networks (CNNs) exhibited superior performance outperforming previous methods [4, 6].

Fig. 1.
figure 1

Informative patches clearly showing markers for distinguishing between healthy (a) and diseased subjects (b). In case of mild impairments, however, the discrimination is more difficult.

Fig. 2.
figure 2

Original endoscopic images showing several kinds of degradation (such as blur, low-contrast, bubbles and overexposure) partly or completely hiding the distinctive disease markers.

Most work in computer aided CD diagnosis [3, 4, 6, 7] relies on ‘informative’ patches only, which were extracted by medical experts. These data is not only utilized for training the classification model, but also for evaluation. For a practical clinical system, however, this means that the physician needs to select a reliable patch before the automated classification approach can be applied. Thereby, manual effort is required and the obtained decision is no longer fully observer-independent.

A holistic (non-patch-based) approach cannot be applied effectively with limited training data, since disease markers are often visible locally only due to degradations (Fig. 2) and the patchy distribution of the disease markers [7].

A further issue is given by the fact that degraded images showing no sensible information are generally very similar (in feature-space) to patches exhibiting a diseased mucosa. The straight-forward application of patch-based approaches to randomly selected patches thereby leads to low as well as imbalanced outcomes considering sensitivity and specificity [8]. In previous work, it was suggested to select (and merge) discriminative patches from original images by computing a weighted linear combination of several basic quality measures to estimate a patch’s quality [8]. These consist of metrics measuring (1) the average illumination, (2) the focus-level, (3) reflections, (4) noise and (5) contrast. Thereby, the classification rates were improved compared to a random patch selection. Nevertheless, a linear combination of five basic measures is unlikely to perfectly represent a patch’s distinctiveness and is furthermore prone to changing imaging settings (e.g. if applying the narrow-band imaging technique [9]).

This work is partly motivated by an approach for histological cancer subtyping [10] which relies on a patch-wise classification followed by an aggregation of patch-wise decisions. That means, for each image, the final representation consists of a histogram collecting the classified subtype occurrences to determine the final image-wise subtype. This approach, however, demands more than two classes, as in case of dichotomization (considered in this work), the aggregation would be similar to majority voting, which is not effective for CD diagnosis [8]. The utilization of multiple instance learning [11] is inhibited by the rather small number of available patients for training.

In this work, the focus is first on learning a discriminative classification model based on CNNs utilizing informative (manually extracted) patches only. The aim of this stage is to obtain a model exhibiting a high discriminative power in case that distinctive information is available in the image. Then, based on automatically extracted image data, we estimate the probability of a correct patch-wise classification by means of a probabilistic model. Making use of ‘real-world’ training data partly containing non-discriminative information, the target of this stage is to obtain a reliable confidence estimation. Consequently, for an image to be evaluated, we obtain a decision as well as a confidence level for several patch-positions. In order to determine an image-wise decision, patch-based probabilities of each image are aggregated and a further classifier is trained to determine the final class of an original image. The training data utilized for fitting the final classifier can be increased without any manual effort when obtaining novel ground-truth labeled data during routine endoscopy (i.e. no further manual selection of informative patches is required in contrast to [8]).

2 Proposed Pipeline

The proposed pipeline (Fig. 3) consists of training a discriminative model (1) based on manually extracted informative patch data (Sect. 2.1), followed by fitting a probabilistic model (2) relying on real-world data (Sect. 2.2). The probabilistic patch-wise outputs are aggregated (3), a final classification model is trained (4) and evaluation is performed per image (5) and finally also a patient-wise classification (6) is proposed (Sects. 2.32.4).

Fig. 3.
figure 3

Overview of the proposed classification pipeline.

The required training data consists of a data set containing ground-truth labeled informative patches (IP) which were extracted by medical experts and show distinct markers for diagnosis, as well as a data set containing ground-truth labeled original endoscopic images showing typical degradations leading to partly invisible disease markers. From this data set, for each image, a large number of not necessarily informative patches (AP) are automatically extracted as predefined position in an overlapping manner. The required processing steps are explained in detail in the following subsections:

2.1 Discriminative Patch-Based Model

First, the IP data set is utilized to train a discriminative classification model to distinguish between the two classes (CD, non-CD). To this end, we train a linear support vector machine (SVM) based on features extracted from CNNs, yielding exceptional performances in previous work [4]. The architectures as well as the training procedure of the utilized CNNs are described in detail in Sect. 2.5.

2.2 Probabilistic Patch-Based Model

The thereby obtained discriminative model is consequently applied to classify the AP data set. As the AP data set contains patches which distinctly differ from the IP data set (e.g. such as blur, low-contrast, bubbles and overexposure as shown in Fig. 2), the achieved patch-wise accuracies are expected to be significantly lower [8]. Additionally, an imbalance between specificity and sensitivity is expected due to higher similarities on average between unreliable data and class \(C_1\) (CD) than between unreliable data and class \(C_0\) (non-CD),  [7].

In the next step, based on the AP data set, the probabilities of a correct classification are estimated. For this purpose, we reuse the discriminative model (CNN feature extraction followed by SVM classification) and apply it to patches of the AP data set resulting in classification outcomes (\(C_0\) or \(C_1\)) for each patch as well as the distance to the decision boundary. In this space, providing the signed distances to the decision boundary, regression is performed by fitting a non-parametric model, specifically a Gaussian mixture model (GMM), to estimate the distribution of correctly and incorrectly classified samples for both classes. We estimate the distribution of correctly classified (\(d_c\)) and incorrectly classified (\(d_i\)) samples in order to determine the probability of a correct classification for a sample x by \(p_c(x) = \frac{d_c(x)}{d_c(x) + d_i(x)}\). This is performed individually for samples classified as \(C_0\) and \(C_1\) obtaining \(p_{c|C_0}(x)\) and \(p_{c|C_1}(x)\).

The GMM is preferred to established methods such as Platt scaling and isotonic regression [12] due to the complete unawareness of the underlying distribution (\(p_c\)) which not necessarily shows a monotonic behavior. This is due to the difference in feature distribution between the IP and the AP data set and the similarity between samples of class \(C_1\) and non-discriminative patches [8].

2.3 Image-Wise Classification

So far, we trained a discriminative as well as a probabilistic model based on informative (IP) and automatically extracted (AP) patch data. In order to obtain one final decision for an image to be evaluated, patches are automatically extracted. For each of the patches, the classification as well as the probabilistic outcome is estimated. Finally, all outputs for one image are merged by building a histogram based on the probabilistic output (\(p_{c|C_0}\) and \(p_{c|C_1}\), respectively) for both \(C_0\) and \(C_1\). These two histograms are concatenated and a further classification model is trained to distinguish between the classes on image-level. Training is again conducted based on the AP data set. As model, a k-nearest neighbor (kNN) classifier is utilized in combination with the histogram intersection as distance measure, which is common practice in histogram classification.

2.4 Patient-Wise Classification

The image-wise evaluation is performed by classifying a histogram obtained from patches of one endoscopic image. In order to obtain a patient-wise decision (i.e. a set of images are considered), the kNN output of all images of one patient is interpreted in a probabilistic way by considering the distribution of the nearest neighbors’ class labels. Based on selecting the image with the highest confidence (i.e. the highest agreement of the nearest neighbors), one final class label is obtained. The advantages of this approach compared to simply collecting the outcomes of the patches of each image per patient into one histogram are given (a) by the higher number of training samples (the number of images is significantly larger than the number of patients) and (b) by the fact that images which could not be clearly assigned to a class (e.g. non-informative images) are not considered during evaluation.

2.5 Image Representations

For image representation, three CNN approaches are evaluated. Two of them exhibited excellent performances in previous work on the classification of manually selected patches [4] and the third one is a combination of the first two approaches:

Non-adapted CNN (NA-CNN) [13]: In the case of the first investigated representation, a convolutional neural network pretrained on the ImageNet challenge dataFootnote 1, specifically the VGG-f network [13], is utilized. We chose this network because it provided the best outcomes for the classification of CD in [4]. The images are fed through the CNN and the activations of the last convolutional layer (4096 feature elements) are extracted as feature vectors.

Adapted CNN (A-CNN) [4]: The second representation (A-CNN) is based on the same network (VGG-f), however the (already pretrained) network is adapted to the classification of CD by training it on the IP data set. In previous work [4], this approach achieved excellent performances in classifying CD and outperformed the NA-CNN approach. Equally to the so called ‘fully fine-tuned’ VGG-f network in [4], the A-CNN network is trained for 5000 iterations utilizing stochastic gradient descent and the training images are randomly augmented (cropping, rotating and horizontally flipping). In fact, the only difference to the ‘fully fine-tuned’ VGG-f network in [4] is that the batches of images extracted for training consist of only 64 instead of 128 images due to the limited amount of available training samples. As for the NA-CNN approach, the activations of the last convolutional layer are extracted as feature vectors.

Combined CNN (C-CNN): The third descriptor C-CNN is a combination of NA-CNN and A-CNN. For this purpose, we concatenate the feature vectors of both approaches leading to an 8192 dimensional representation.

For further details as well as for evaluations of the utilized image representations, we refer to the original publication, where they were exhaustively assessed with respect the the classification of informative patches [4].

3 Experiments

3.1 Image Data Sets

The utilized material consists of images captured by physicians during routine gastrointestinal endoscopies at the St. Anna Children’s hospital. Images where captured with Olympus endoscopes (GIF-Q165, GIF-N180) providing lossless-compressed (PNG) images with resolutions between \(528\,\times \,522\) and \(768\,\times \,576\) pixels.

For training and validating the proposed pipeline, an IP data set containing informative patches only (\(128\,\times \,128\) pixels) and an AP data set containing automatically extracted patches (72 patches (\(128\,\times \,128\)) per image at fixed coordinates) is required. Due to the relatively small original endoscopic images, thereby an overlap between neighboring patches occurs. This, however, does not introduce any bias into the evaluation. Specifically, we extracted the patches in a fixed rectangular grid (nine coordinates in horizontal and eight in vertical direction) with an offset of 40 pixels. The grid is adjusted to capture real image information only and to exclude the outer parts providing meta information only (Fig. 2).

For the IP data set, an experienced physician manually selected between zero and four informative regions per endoscopic image exhibiting markers for an effective distinction between the two classes [7]. The IP data set is only utilized for training whereas the AP data set needs to be separated into a training and an evaluation set. To facilitate an unbiased evaluation, we take care that data of one patient (in the IP and the AP data set) is either used for training or for validation. We automatically generate four IP and corresponding AP data sets without overlaps in patient data (Table 1). To facilitate fusion on patient level (i.e. we obtain one decision per patient), the AP data sets contain four images for each patient. The discriminative model is trained on the IP data set after CNN feature extraction. For certainty estimation, the AP data set is randomly split into two equally sized, balanced sets (without overlaps of patient’s data). One is utilized for certainty estimation and for providing the kNN’s training samples and the other one for evaluation. This step is repeated for swapped training and evaluation sets. This policy is applied for all corresponding data sets (IP and AP) and the obtained mean classification rates (accuracy, sensitivity, specificity) as well as the standard deviations are finally reported.

We consider a binary classification between images showing healthy mucosa (‘Marsh-0’, in the following \(C_0\)) and CD (‘Marsh-3’, in the following \(C_1\)) [1, 7]. The reason for working with this problem definition is given by the image data set available which contains rather few images for certain subclasses (e.g. ‘Marsh-3C’) when using the four-classes case (‘Marsh-0’ vs. ‘Marsh-3A’ vs. ‘Marsh-3B’ vs. ‘Marsh-3C’). Furthermore, this two classes case is most relevant for clinical practice [7].

Table 1. Details on the data sets utilized for training and evaluation.

3.2 Evaluation and Implementation Details

As discriminative model to classify the extracted CNN features, a C-SVM (LIBLINEAR [14]) is applied exhibiting excellent performances [4]. The SVM cost factor (C) is evaluated utilizing inner cross validation (between \(2^0\) and \(2^{10}\)) on the IP training data. The k value for kNN regression is also evaluated during inner cross-validation (\(1\,\le \,k\,\le \,51\), step size 5). For the GMM, 64 components are specified to precisely estimate the distributions. Due to the low dimensionality, the large number of samples and the rather smooth distribution, this number is not critical and is thereby not optimized. The GMM was initialized randomly and optimized by means of expectation maximization. For the histograms building the final image representation, a linear binning with eight bins is utilized to obtain a trade-off between precision and robustness. For training the A-CNN, MatConvNet [15] is utilized.

Fig. 4.
figure 4

Experimental results for image-wise (a) as well as patient-wise (b) classification.

3.3 Results

Figure 4 shows the classification results achieved with the proposed approach compared to the state-of-the-art quality-metric-based approach [8]. For image-wise classification, overall classification rates (accuracies) between 0.741 and 0.772 are obtained. C-CNN performs best on average, the proposed method exhibits a similar performance compared to the state-of-the-art. Regarding patient-based diagnosis, our proposed approach in combination with the C-CNN image representation exhibits the best accuracies (0.848).

Classifying on patch-level without any fusion (not shown in the table), we obtain accuracies between 0.60 and 0.65, which is clearly lower than the rates obtained for informative patches reported in [4] (accuracies above 0.9). Even with majority voting (similar to [10] in the two-classes case), accuracies are always below 0.7 for image-wise and patient-wise classification.

Figure 5 shows the rates obtained with selective classification, i.e. we perform image-wise classification with the restriction that only certain images are selected according to the confidence measure of the kNN classifier. This experiment is performed to facilitate a prediction if more (and more diverse) images would be available per patient. In these outcomes, a confidence > 0.7 indicates that at least 70% of the nearest neighbors must belong to one particular class so that a certain image is selected. By selecting images with a high confidence only, the accuracy increased up to 0.903 (NA-CNN) without obtaining a severe imbalance between sensitivity and specificity. Only the A-CNN feature suffers from a decreased sensitivity for high confidences and hence does not show comparable outcomes.

Fig. 5.
figure 5

Experimental results for image-wise classification with varying confidence levels. The confidence level here corresponds to the distribution of class labels of the kNN’s selected neighbors. The left column (confidence 0.5) corresponds to the typical setting (Fig. 2 (a)), where each image is classified.

Table 2. Classification results achieved with the proposed and the metric-based reference approach [8] for image- and patient-wise classification. Additionally to the sensitivity (sens), the specificity (spec) and the accuracy, we provide the accuracies’ standard deviations for classification without a minimum confidence (upper part). In case of introduced confidences (image-wise classification), we provide the fraction of the selected images in brackets.

The precise figures as well as sensitivities and specificities of all experiments are provided in Table 2. For the experiment with selective classification, the fraction of selected samples (images) is provided in brackets.

3.4 Discussion and Conclusion

We proposed a method to obtain decisions from computer aided diagnosis systems without any interaction based on original endoscopic image data. A previous method [8], representing the state-of-the-art relying on a linear combination of basic quality measures, was outperformed for patient-based classification for each image representation. Furthermore, the variability in performance of our proposed approach is lower. Another advantage of the proposed method compared to the quality-metric-based approach is that no handcrafted metrics are required for assessing the quality. Consequently, no conceptual changes need to be performed in the case of changing the imaging conditions. Furthermore, an augmentation of training data, specifically the AP data set, can be performed by adding endoscopic images obtained during endoscopies (i.e. no manual selection of informative patches is required for enlarging the training data set as in [8]).

Interestingly, the adapted A-CNN, exhibiting the best accuracies in patch-wise classification for idealized patch data [4], leads to inferior outcomes for image- and patient-wise classification utilizing non-idealized patch data. We suppose that by training a CNN to discriminate between informative patches with clearly visible mucosal structures, the CNN-features lose distinctive power for rating non-informative patches. The structure of the mucosal villi is the most important criterion for the visual differentiation between a healthy mucosa and CD. The mucosal inflammation in CD causes either a mild villi atrophy, a marked villi atrophy or an entirely absence of villous structure, depending on the severity of the disease. Also, in non-informative patches, villous structures are often not visible due to image degradations (e.g. out of focus, low contrast, etc.) and so the CNN probably misinterprets those non-informative patches without visible villi to be affected by CD. As a consequence, the network output cannot be utilized effectively to estimate probabilities for correct classifications. Therefore, training CNNs utilizing informative CD image data actually turned out to be a disadvantage for the classification of non-informative images. However, this disadvantage can be compensated by combining adapted CNNs with non-adapted CNNs (C-CNN), which provided the best accuracies for image- and patient-wise diagnosis. In case of a selective classification, we notice that NA-CNN outperforms C-CNN which is probably caused by the increasing imbalance between sensitivity and specificity for the A-CNN at higher rates of confidence.

In this study, we put emphasis on including highly realistic endoscopic images and not only rather idealized data into the image data sets for evaluation. For this reason, the proposed approach could be implemented into a clinical system without any significant changes. Considering the obtained classification accuracies, a point of criticism could be that the obtained accuracies are still far away from 100%. However, we identified several scenarios which can lead to distinctly higher accuracies in the clinical routine: Novel endoscopic devices (exhibiting higher resolutions and new modalities) could potentially be applied in order to improve the classification performance even further. Furthermore, the data for evaluation exhibits a high degree of correlation between a patient’s images. We had to include patient’s image data although the images show similar regions of the mucosa in order to keep the amount of data relatively high. Based on the outcomes with increased confidence levels (Table 2), it can be assumed that by utilizing uncorrelated image material the rates could be improved significantly. Finally, even video material (endoscopic video frames) could be utilized additionally to the images taken by the physicians to increase data diversity.

To conclude, we proposed an effective approach for fully-automated CNN-based classification of endoscopic images for CD diagnosis. Notably, the best performances are not obtained with the adapted CNN, but with a network trained for the image-net challenge as well as with combinations of the two evaluated networks. Fusing data on patient-level, an accuracy of 0.85 can be obtained and the state-of-the-art is thereby clearly outperformed. Experiments provide evidence that the rates could be increased further if more data per patient would be available, creating incentive for incorporating video material in future work.