Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

VisActive: Visual-concept-based Active Learning for Image Classification under Class Imbalance

Published: 10 November 2023 Publication History

Abstract

Active learning methods recommend the most informative images from a large unlabeled dataset for manual labeling. These methods improve the performance of an image classifier while minimizing manual labeling efforts. We propose VisActive, a visual-concept-based active learning method for image classification under class imbalance. VisActive learns a visual concept, a generalized representation that holds the most important image characteristics for class prediction, and then recommends for each class four sets of unlabeled images with different visual concepts to increase the diversity and enlarge the training dataset. Experimental results on four datasets show that VisActive outperforms the state-of-the-art deep active learning methods.

1 Introduction

Deep image classification models have achieved outstanding performance in various applications [1, 2]. However, their success is strongly related to the availability of large, annotated datasets to learn image features critical for accurate predictions [3]. In some domains like medicine, manual labeling of large datasets by domain experts is very expensive and infeasible [4,5]. Deep Active Learning (DAL) methods were proposed to recommend the least number of images from a large unlabeled dataset for manual labeling while maximizing the image classifier's performance [69].
Given an initial labeled dataset, typically a small one, a DAL method trains a deep model on this dataset and uses this newly trained model to recommend unlabeled images for human labeling. Finally, the newly labeled images are moved from the unlabeled dataset to the new training dataset [1013]. This process is repeated until a pre-defined termination condition is reached [3]. Examples of these conditions are that a budget for manual labeling is exhausted or a certain performance goal is reached.
Class imbalance further complicates the curation of a good training dataset. Class imbalance occurs when some classes (common classes) have many more images [14] compared to the other classes (rare classes) [15]. Class imbalance is present in many application domains [1619]. Curating a good training dataset under class imbalance is more time-consuming due to the difficulty in finding enough rare class images and selecting images representing a variety of appearances within the common class [14]. For example, in screening colonoscopy, polyps were found in 23.9% to 35.7% of at least patients aged 65 years and older [20]. Furthermore, polyps are typically visible for a few seconds [21] during the withdrawal phase (recommended for at least 6 minutes [22]). DAL methods that do not consider class imbalance are likely to recommend more images of the common class. This is because there are more of these images with varying appearances [23].
In this article, we propose VisActive (shown in Figure 1), a visual-concept-based active learning method for image classification under class imbalance. VisActive automatically finds visual concepts for each class. The visual concept [21] is defined as a representation most similar to a set of image feature patterns used by an image classifier for class prediction. Each class is associated with several visual concepts learned from the training images. Visual concepts could represent objects, parts of objects, textures, or colors relevant to class prediction. For a polyp image classification problem, a visual concept could represent an entire polyp or its shape, color, or texture related to the polyp class. VisActive uses the visual concepts and our proposed uniqueness and coverage scores to recommend four sets of unlabeled images for each class. The goal is to increase the size and diversity of the training dataset while minimizing labeling costs.
Fig. 1.
Fig. 1. The VisActive deep active learning framework for image classification under class imbalance. Concept Learner is introduced to capture visual concepts in the current training dataset. Uniqueness and coverage scores are introduced to diversify the current training dataset. Rare-Concepts Set is for handling class imbalance. Cross-Concepts Set is to reduce confusion between classes. Unseen-Concepts Set is for obtaining concepts not already in the training dataset. The four sets of unlabeled images are recommended in each iteration to add the size and diversity of the initial training dataset.
The four sets are as follows: (1) The common-concepts set has unlabeled images with visual concepts similar to those present in many images of a given class in the labeled dataset. We include this set to ensure that the final training dataset has adequate labeled images with common visual concepts. (2) The rare-concepts set contains unlabeled images with visual concepts similar to those present in a small portion of the labeled images of a given class. The goal is to maintain a good balance between the rare and common visual concepts in each class. (3) The cross-concepts set has unlabeled images with visual concepts in both the rare and common classes in the labeled dataset. Labeling the cross-concepts set images reduces the confusion of the classifier when determining the correct class. (4) The unseen-concepts set contains unlabeled images with visual concepts not present in the labeled dataset. Acquiring this set should increase the diversity of the labeled dataset with new concepts to improve the classifier's performance in practice.
Our contributions are summarized as follows:
We propose VisActive with three new ideas: (1) The use of visual concepts for active learning in contrast to using global features is proposed. Having different concepts provides more options for recommendations to enhance the size and diversity of the training dataset. The training algorithm used for obtaining the visual concepts for each class is new to this work. The training algorithm is designed to handle class imbalance. (2) New indicators, called uniqueness and coverage scores, associated with each visual concept are proposed. A uniqueness score of a visual concept of a class is highest when the visual concept is not in the training images of any other class except its class. On the other hand, a coverage score of a visual concept is highest if the visual concept is in all the training images of its class. We use these two indicators to find the most informative unlabeled images to recommend. (3) VisActive recommends the four aforementioned sets of unlabeled images. The rare-concepts set is first introduced in this work. The source code of VisActive is publicly available at github.com/cml-cs-iastate/VisActive.
We evaluate VisActive on three public datasets (Kvasir, Caltech, and CIFAR10) and a dataset with a limited disclosure (Forceps). These datasets have different numbers of classes, class imbalance ratios, and subject domains. Based on the macro F1-score, VisActive outperforms five methods for binary classification and three methods for multi-class classification.
We conducted four ablation studies. The first two studies give insight into the improvement each recommended set contributes to classification performance. The third study investigated whether the learned visual concepts are indeed useful for image classification. The last study investigated the characteristics of the uniqueness and coverage scores of our visual concepts.
The remainder of this article is as follows. Section 2 presents related works in DALs and the background work. In Section 3, we describe the proposed work. Section 4 includes experimental setup and results. The conclusion and the description of future work are presented last.

2 Related Work and Background

We first describe related deep active learning methods. Non-deep learning methods are summarized in [15]. Next, we provide background on the method for learning visual concepts [21].

2.1 Deep Active Learning

DAL methods are divided into query-synthesizing or query-acquiring approaches, as summarized in the survey [19]. The query-synthesizing approach uses generative models to synthesize informative samples. The query-acquiring approach recommends the most informative images from the unlabeled image set. This approach has been intensively investigated; therefore, we focus on the query-acquiring approach in this article. Table 1 presents a summary of the query-acquiring DAL methods. The best-performing method in each category is shown in bold.
Table 1.
CategoryMethodDesigned for Class Imbalance
Uncertainty ApproachCost-Effective Active Learning (CEAL) [26]
Active Incremental Fine-Tuning (AIFT) [28]
Pretext Tasks for Active Learning (PT4AL) [29]
Loss Prediction Module [30]
Deep Bayesian Active Learning (DBAL) [31]
No
No
Yes
No
No
Feature-based ApproachSimilarity-based Active Learning (SAL) [14]
Core-set [27]
ProbCover [35]
Yes
No
No
Variational AutoencoderVariational Adversarial Active Learning (VAAL) [36]
Variational Bayes for Active Learning (VaB-AL) [23]
Task-Aware Variational Adversarial AL (TA-VAAL) [37]
No
Yes
Yes
Table 1. Summary of Related Works on DAL
Uncertainty approach: This approach measures uncertainty based on classification probabilities and recommends the most uncertain images from the unlabeled dataset [2428]. These images are expected to capture the most informative features that are not already present in the labeled dataset. Adding the newly labeled images verified by the domain expert improves the classification performance. Cost-Effective Active Learning (CEAL) [26] follows the uncertainty approach and recommends unlabeled images based on a classification probability threshold. Furthermore, CEAL does pseudo-labeling on unlabeled samples with a very high-class prediction probability over another threshold that is automatically adjusted for each iteration. These images are included in the training dataset for the predicted class for the next iteration without domain expert verification. Pseudo-labeling was shown to improve CNN classification performance.
Active Incremental Fine-Tuning (AIFT) [28] utilizes data augmentation for recommending unlabeled images. AIFT is based on an observation that all images synthesized from the same image share the same label and are naturally expected to have similar predictions by the image classifier. AIFT recommends unlabeled images based on the authors’ definitions of entropy and diversity measures on some synthesized images of each unlabeled image. AIFT does not perform well under class imbalance [14]. When the images of the common class are much more diverse than those of the rare class, the entropy and the diversity measures are biased toward recommending images of the common class.
Pretext Tasks for Active Learning (PT4AL) [29] utilizes self-supervised pretext tasks and an uncertainty data sampler to recommend images for human labeling. PT4AL does not require the initial manually curated labeled dataset and is applicable for several main tasks, e.g., image classification and semantic segmentation. The method trains a pretext task model on the unlabeled images with a rotation prediction loss that is highly correlated to the main task loss, i.e., image classification loss. Then, PT4AL sorts the unlabeled images in descending order of the loss values and recommends the images with the highest losses for the initial seed dataset. For recommendation in a subsequent iteration, PT4AL uses an uncertainty measure based on the posterior class probability of the remaining unlabeled images predicted by the image classifier. PT4AL uses a batch splitter to ensure balanced sampling across the entire data distribution. PT4AL can handle a mildly imbalanced dataset well [29]. However, our experimental results show that PT4AL does not perform as well for moderately to severely imbalanced classes.
The authors of [30] proposed a task-agnostic active learning method. They attached a small parametric module called the “loss prediction module” to the image classifier to learn to predict the losses for the unlabeled images. Then, this module predicts the unlabeled images that are likely to fail correct classification by the image classifier. The loss prediction module is simple and applicable to any neural network regardless of the task, the number of tasks, or the complexity of the main task architecture. However, like CEAL and AIFT, the loss prediction module is not designed for class imbalance.
Deep Bayesian Active Learning (DBAL) [31] utilizes Bayesian convolutional neural networks [32] to build an active learning framework using graph-based approaches to handle high-dimensional image data. DBAL first performs Gaussian prior modeling on the weights of a CNN and then uses variational inference to obtain the posterior distribution of the network prediction. DBAL also uses the Monte-Carlo dropout stochastic regularization technique [33] to obtain posterior samples. DBAL achieved good accuracy on real-world datasets. However, it is unsuitable for large datasets due to the need for batch sampling [27].
Feature-based approach: This approach extracts important features from labeled images and utilizes these to recommend informative images from the unlabeled dataset. The authors of [14] proposed Similarity-based Active Learning (SAL) to handle real-world scenarios with both a significant class imbalance and a limited labeled dataset. During training, SAL creates many combinations of triplets, each with two images of the same class and one image of a different class. SAL optimizes the similarity-based triplet loss to generate a feature extractor that best discriminates between classes. SAL calculates one class-feature representation by averaging the features of all the labeled images of that class. As a result, the rare features of the class are excluded from the class-feature representation. SAL recommends unlabeled images for each class based on the distance between the class-feature representation and the features of the unlabeled images. Like CEAL, SAL also pseudo-labels unlabeled images with a high classification probability given a threshold and automatically adjusts the threshold value for pseudo-labeling for each iteration.
The authors of the Core-Set active learning method [27] proposed a new core-set loss, which is the difference between the average empirical loss over labeled data points and that over the entire dataset including the unlabeled data. For each iteration, solving the k-Center problem [34] on the unlabeled data points minimizes the upper bound of the loss. The unlabeled images at the centers are recommended for each iteration. This method is computationally expensive, especially with a large unlabeled dataset since it requires building a large distance matrix to solve the k-Center problem [3]. Related to Core-Set, the authors of ProbCover [35] proposed an active learning method for a low-budget regime that maximizes probability coverage.
Variational autoencoder approach: This approach utilizes Variational Autoencoder (VAE) to learn the latent space of labeled images. Then, the learned representations are used to recommend the most informative images to be labeled by a human expert. The authors of Variational Adversarial Active Learning (VAAL) [36] adapted VAE to generate image representations similar to those of labeled and unlabeled images. VAAL uses an adversarial network that learns to discriminate the feature representations of the unlabeled and labeled images. The goal is to recommend unlabeled images with features different from those found in the labeled dataset. VAAL is task-agnostic. Task-Aware Variational Adversarial Active Learning (TA-VAAL) [37] alters VAAL to be task aware by considering the classification probability of each unlabeled data point before labeling. TA-VAAL uses the loss ranking prediction [30] and embeds the normalized ranking loss information from the image classifier to identify images with low classification probability. Variational Bayes for Active Learning (VaB-AL) [23] uses VAE to estimate the lower bound of the likelihood of misclassifying an unlabeled image. VaB-AL was shown to perform better than the Core Set algorithm [27], VAAL, and DBAL [31]. Like TA-VAAL, VaB-AL is task aware and performs well under class imbalance.
Compared to recent DAL methods that work well for class imbalance based on VAE, such as TA-VAAL and VaB-AL, the proposed VisActive works well with even more limited labeled data. VisActive does not rely on the training of the auto-encoder and auto-decoder, which usually requires comparably more training data. Furthermore, VisActive gives fine-grained control over its recommendation, e.g., finding rare concepts, cross-concepts, and unseen concept sets of images. Therefore, the final image classifier is effective across different image characteristics of rare and common classes. Compared to SAL, VisActive gives more fine-grained features at the object or part of object levels. Hence, VisActive can be more intentional about what to recommend.

2.2 Visual Concept Layer

Khaleel et al. previously proposed the visual concept (VC) layer to learn visual concepts automatically and use them to explain a classifier's decision on a given image [21]. In VisActive, we utilize this component to learn different key feature patterns in the training data. We summarize the VC layer here. Let \(fm\) represent \(d\) feature maps of size \(W \times H\) values; \(fm\) is the output of the last convolutional layer of the classifier of a given image. Each neuron in the VC layer learns one visual concept for a given class using a filter \(vc \in {\mathbb{R}}^{( {h + 1} ) \times ( {w + 1} ) \times d}\) , where \(1 \le h < H\) and \(1 \le w < W\) are the predefined height and width of the visual concept, respectively. Choosing appropriate values for \(h\) and \(w\) depends on the input image size and the depth of the convolutional network. For example, for a 64 \(\times\) 64 image, we recommend setting the value of \(h\) and \(w\) to four with shallow networks and to one or two for a deeper network. The visual concept size corresponds to a receptive field at the input layer. We recommend choosing a visual concept size so that its receptive field is large enough to contain most of an object for classification. The filter \(vc\) is randomly initialized. The training process keeps adjusting the filter values to reduce the training loss. The resulting values of the filter \(vc\) represent one visual concept for its class. The loss function used in [21] relies on the distance function in Equation (1) between the filter \(vc\) and the corresponding feature maps partition \(fv \in {\mathbb{R}}^{( {h + 1} ){\rm{\ }} \times {\rm{\ }}( {w + 1} ){\rm{\ }} \times d}\) when sliding the filter \(vc\) across \(fm\) with a stride of one:
\begin{equation} dist\left( {vc,fv} \right) = minpool\left( {avgpool\left( {\|vc - \lambda \left( {fv} \right)\|_2^2{\rm{\ }}} \right)} \right). \end{equation}
(1)
As shown in Figure 2, the function \(\|vc - \lambda ( {fv} )\|_2^2\) calculates the squared \({L}_2\) distance for each of the \(d\) dimension vectors of \(vc\) and \(\lambda ( {fv} )\) . The \(\lambda ( {fv} )\) multiplies each border cell in \(fv\) with an importance constant between 0 and 1. The use of the \(\lambda ( {fv} )\) function keeps the most important features of the target concept in the central part of the filter. Next, \(avgpool()\) , with the window size \(h\ \times w\) and a stride of one, outputs the average of the values under the window, which gives a matrix of \(h\ \times w\) values. Then, \(minpool()\) with window size \(h\ \times w\) returns the minimum distance score between \(vc\) and \(fv\) . Sliding \(vc\) across \(fm\) results in a distance matrix \(D{M}^{vc}\) that stores the minimum average distance between \(vc\) and each partition \(fv\) of \(fm\) . Last, we select the minimum distance in \(D{M}^{vc}\) using the \(min( \ )\) function and convert the result to a similarity score, as shown in Equation (2). The result of Equation (2) is the highest similarity score of the visual concept \(vc\) in the feature map \(fm\) . The value range of \(sim( {vc,\ fm} )\) is between 0 and 1. We utilize the VC layer in our proposed method.
\begin{equation} sim\left( {vc,\ fm} \right) = \frac{1}{{1 + min\left( {D{M}^{vc}} \right)}} \end{equation}
(2)
Fig. 2.
Fig. 2. The workflow of the VC layer for one visual concept. The dimensions of \({\boldsymbol{fv}}\) and \({\boldsymbol{vc}}\) are \(({\boldsymbol{h}}\) +1) \(\times\) ( \({\boldsymbol{w}}\) +1) \(\times\) \({\boldsymbol{d}}\) , respectively. See Section 2.2 for more details.

3 Proposed VisActive: Visual-concept-based Active Learning Framework

For ease of exposition, we describe our framework in the context of a binary image classification problem. We assume that the classification problem has an inherent moderate to severe class imbalance of at least a 1:10 ratio between the rare and common classes. Let \({X}^l\) be a set of labeled images where each image has one associated label in \(C\) —the set of class labels. \(| C |\) denotes the number of classes. Let \({X}^u\) be a large unlabeled imbalanced dataset.
Figure 1 shows the overall framework with two phases. Phase I starts by training the Concept Learner (CL) on \({X}^l\) . The initial \({X}^l\) is small and class balanced. The initial dataset is class balanced to avoid a bias when learning visual concepts for each class. Phase II utilizes the trained CL model on \({X}^u\) to generate four sets of candidate images to be manually labeled. After the domain experts manually label them and move them into their respective classes in \({X}^l\) , the two phases are repeated until the stopping condition is met.

3.1 Phase I: Train the Concept Learner

We train the CL model on \({X}^l.\) We describe the model architecture, the loss function, and the training algorithm. Our CL model utilizes the VC layer of [21] summarized in Section 2.2 but uses a new training algorithm, Algorithm 1, and a slightly different loss function from [21].

3.1.1 CL Model Architecture.

Our CL model architecture consists of two parts: the convolutional block to extract important image features and the visual concepts block to generalize these features into visual concept representations. For simplicity, we utilize four convolutional layers for the convolutional block, where each convolutional layer is followed by a pooling layer. Other convolutional configurations, such as ResNet or VGG, can be used. The selection of the convolutional block mainly depends on the complexity of the dataset. The output feature maps of the convolutional block are inputs to the visual concept block. This block has one added convolutional layer, a VC layer, and a classification layer, as shown in Table 2. This added convolutional layer performs convolutional operations with a filter of size 1 × 1 followed by a sigmoid activation function to output \(d\) feature maps with values between 0 and 1. The feature maps are input to the VC layer summarized in Section 2.2. We configure the VC layer to learn \(m\) distinct visual concepts for each class. As a result, for binary classification, there are \(2m\) outputs to the classification layer. The VC layer calculates a similarity score for each visual concept using Equation (2). These scores are passed to the classification layer to calculate the loss function \(\mathcal{L}\) as described in Section 3.1.2. The classification layer is a fully connected layer with a softmax activation function. With a limited labeled dataset, learning a good representation is difficult. Methods that learn image embeddings such as VAE typically require a reasonable-sized labeled dataset to start with. Therefore, we use visual concepts [21] instead.
Table 2.
LayerFilter SizeInput SizeOutput Size
Added Convolutional Layer \(1\ \times \ 1\ \times d\) \(H \times W \times d\) \(H \times W \times d\)
Visual Concept (VC) Layer \(( {h + 1} ) \times ( {w + 1} ) \times d\) \(H \times W \times d\) \(m \times | C |\)
Classification Layer- \(m \times | C |\) \(| C |\)
Table 2. The Visual Concept Block Architecture of the Concept Learner

3.1.2 Loss Function for Training the CL Model.

Our loss function \(\mathcal{L}\) consists of cross-entropy ( \({\mathcal{L}}_{entropy}\) ) [38], diversity cost ( \({\mathcal{L}}_{diversity}\) ) [21], and clustering cost ( \({\mathcal{L}}_{clst}\) ) [39] as defined in Equations (3) to (5). \({{\rm{\Gamma }}}_e\) , \({{\rm{\Gamma }}}_d\) , and \({{\rm{\Gamma }}}_c\) are the coefficient multipliers with a value between 0 and 1. We empirically set \({{\rm{\Gamma }}}_e\) to 1, \({{\rm{\Gamma }}}_d\) to 0.2, and \({{\rm{\Gamma }}}_c\) to 0.8. \({\mathcal{L}}_{entropy}\) measures the classification error, which helps to extract convolutional features for each class.
\begin{equation} \mathcal{L} = {\Gamma }_e{\rm{\ }}{\mathcal{L}}_{entropy} + {\rm{\ }}{\Gamma }_d{\rm{\ }}{\mathcal{L}}_{diversity} + {\Gamma }_c{\rm{\ }}{\mathcal{L}}_{clst}{\rm{\ }} \end{equation}
(3)
The diversity cost defined in Equation (4) guides the learning of the visual concepts of each class \(c\) to increase the pair-wise distances of all the visual concepts of the class \((IntraDis{t}_c)\) . The goal is to learn more diverse concepts in each class.
\begin{equation} {\mathcal{L}}_{diversity} = - \frac{1}{{\left| C \right|}}{\rm{\ }}\mathop \sum \limits_{c \in C} IntraDis{t}_c{\rm{\ }} \end{equation}
(4)
\begin{equation*} IntraDis{t}_c = \mathop \sum \limits_{v{c}_1 \in V{C}_c} {\rm{\ }}\mathop {\min }\limits_{\begin{array}{@{}*{1}{c}@{}} {\scriptstyle v{c}_2 \in V{C}_c}\\[-4pt] {\scriptstyle v{c}_2 \ne v{c}_1} \end{array}} (\|v{c}_1 - v{c}_2\|_2^2), \end{equation*}
where \(| C |\) is the total number of classes in \(C\) , \(vc \in V{C}_c\) is a visual concept of the class \(c\) , and \(\|v{c}_1 - v{c}_2\|_2^2\) is the squared \({L}_2\) distance function between two vectors of flattened matrices.
The clustering cost in Equation (5) guides the learning so that each training image \(x\) with \(c\) as its class label has one of the visual concepts of class \(c\) be most similar to a patch in the image.
\begin{equation} {\mathcal{L}}_{clst} = \frac{1}{{\left| {{X}^l} \right|}}\mathop \sum \limits_{x \in {X}^l} \mathop {\min }\limits_{vc \in V{C}_c} \left( {1 - sim\left( {vc,\ f{m}_x} \right)} \right){\rm{\ }}where{\rm{\ }}c = {y}_x,{\rm{\ }} \end{equation}
(5)
where \(| {{X}^l} |\) is the number of images in \({X}^l\) , \(min( \ )\) is the minimum function, and \(sim( {vc,\ f{m}_x} )\) returns the similarity score computed using Equation (2) between the visual concept \(vc\) and \(f{m}_x\\)\) representing the feature map of image \(x\) .

3.1.3 CL Training Algorithm.

Algorithm (1) shows the CL training algorithm, taking the set of the labeled images \({X}^l\) and the number of epochs for each stage, \({e}_1\) , \({e}_2\) , and \({e}_3\) , as input. The number of epochs for each stage, \({e}_1\) , \({e}_2\) , and \({e}_3\) , is empirically set to give enough iterations for each stage to converge during the training. The different training stages are inspired by transfer learning and are important to obtain good visual concepts. Step 1 of Algorithm 1 constructs and trains an image classifier on \({X}^l\) . Step 2 constructs the CL model. The CL convolutional block has the same architecture as the convolutional layers of the image classifier. Step 3 transfers the learned weight values of the convolutional layers of the image classifier to the corresponding layers of the CL model. This helps the CL model to start the training with useful features that have been extracted by the image classifier.
In Steps 4 to 7, we train the CL model on \({X}^l\) using the loss function \(\mathcal{L}\) defined in Equation (3). Step 4 generates batches with an equal number of images from both classes. We use a balanced batch generator to train the model in Steps 5 to 7. Step 5 trains only the VC layer of the CL model for \({e}_1\) epochs while keeping the weight values of the other convolutional layers unchanged. In this step, the weights for the connections between the neurons in the VC layer and those in the classification layer of the CL model are set as done in [39]. A value of 1 is given to the connections between the neurons of the VC layer for the rare class and the neuron of the classification layer that holds the probability of the rare class. We give a value of \(-\) 0.5 to the connections between the neurons of the VC layer of the rare class and the neuron of the classification layer that holds the probability of the common class. The aim is to learn the visual concepts of the VC layer from the convolutional features captured in the training of the image classifier, as done in transfer learning [4042]. Step 6 trains the convolutional layers and the VC layer of the CL model for \({e}_2\) epochs. This allows the weights of the convolutional layers to be updated based on the loss function \(\mathcal{L}\) defined in Equation (3). Finally, Step 7 trains the entire CL model including the classification layer of the CL model for \({e}_3\) epochs. The goal is to adjust the weights of each visual concept to each class prediction. Section 4 includes a discussion regarding the appropriate number of epochs for these steps.
Initially \({X}^l\) is balanced. However, after some iterations, \({X}^l\) can become unbalanced when images of the rare class in \({X}^u\) are much fewer than those of the common class. We propose a balanced batch generator that randomly selects an equal number of images from each class in each batch. We use random sampling with replacement for the selection of the rare class images. We ensure that each image of the common class is only selected once per epoch. Thus, we guarantee that the optimization function updates the weights of the CL model without favoring the common class.

3.2 Phase II: Recommendation

Phase II searches for the candidate unlabeled images to recommend. We propose the uniqueness and coverage measures to understand how distinct the learned visual concepts among the classes are and the prevalence of these concepts in the training dataset. Let \(V{C}_c\) be the set of visual concepts for the class \(c\) . Recall that \(C\) is a set of class labels and \(| C |\) denotes the number of classes. The uniqueness score of a visual concept of the class \(c\) is defined in Equation (6):
\begin{equation} Uniqueness\left( {vc \in V{C}_c} \right) = 1 - \frac{{\mathop \sum \nolimits_{\begin{array}{@{}*{1}{c}@{}} {\tilde{c} \in C}\\ {\tilde{c} \ne c} \end{array}} {\mathbb{I}}_{{I}^{\tilde{c}}}}}{{\left| C \right|}},{\rm{\ }} \end{equation}
(6)
where the \({\mathbb{I}}_{{I}^{\tilde{c}}{\rm{\ }}}\) indicator function evaluates to 1 if the visual concept \(vc\) is closest to any image patch of any training image \({I}^{\tilde{c}}\) of class \(\tilde{c}\) where \(\tilde{c} \ne c\) . Otherwise, this function evaluates to 0. We measure the closeness of a visual concept to an image patch of a given image using Equation (2). A visual concept \(vc\) of the class \(c\) has a uniqueness score of 1 when this \(vc\) is unique only to this class.
The coverage score of a visual concept for the class \(c\) is defined in Equation (7).
\begin{equation} Coverage\left( {vc \in V{C}_c} \right) = \frac{{\mathop \sum \nolimits_{i \in {I}^c} {\mathbb{Z}}_i}}{{\left| {{I}^c} \right|}},{\rm{\ }} \end{equation}
(7)
where \(| {{I}^c} |\) represents the number of images in \({I}^c,{\rm{\ }}\) the set of training images of the class \(c\) . The indicator function \({\mathbb{Z}}_i\) equals 1 if the \(vc\) is ranked in the top 5% of most similar visual concepts (among all visual concepts) to any patch of any image \({\rm{\ }}i \in {I}^c\) of this class. Similarly, a coverage score of 1 means that this \(vc\) is among the top 5% of the most similar visual concepts in all the images of this class. The uniqueness and coverage scores were inspired by the success of the confidence and the support scores of an association rule discovered by association rule mining algorithms [11].
Algorithm (2) uses Equations (6) and (7) to recommend candidate unlabeled images. For ease of exposition, we assume two classes. Therefore, \(| C |\) is 2. Step 1 gives an unlabeled image \({x}^u\) to the trained CL model. As part of the inference process on \({x}^u\) , the similarity score for each visual concept of \({x}^u\) is calculated using Equation (2). As a result, we have \(2m\) similarity scores. The similarity score for each of the \(m\) visual concepts is the highest similarity between the visual concept and any image patch in the given image. Let \(hsc( {{x}^u} )\) output the highest similarity score of the \(2m\) similarity scores for \({x}^u\) . Step 3 sorts the unlabeled images in the descending order of their highest similarity score. Steps 4 through 14 select \(\delta\) best candidate unlabeled images for each of the four sets: cross-concepts set (XCS), common-concepts set (CCS), unseen-concepts set (UCS), and rare-concepts set (RCS).
Consider \({\rm{\ }}{t}_{clos}\) closest visual concepts for \({x}^u\) based on their similarity score. Step 5 uses \(crossConcept( {{x}^u,{\rm{\ }}{t}_{clos}} )\) to determine whether half of these visual concepts are of the rare class and the other half are of the common class(es). If so, the image has an equal number of concepts from both classes. Hence, it is suitable for the domain expert to label these images to reduce confusion by the classifier. Such an image is moved to the cross-concepts set in Step 6 if this set has fewer than \(\delta\) images. Next, if \(vc\) , the closest visual concept of \({x}^u\) , is also the closest visual concept in a large enough proportion of the labeled images, i.e., \(coverage( {vc} ) > {t}_{Cove}\) , and with a high enough similarity score, i.e., \(sim( {vc,\ f{m}_{{x}^u}} ) \ge {t}_{sim}\) , the \(vc\) is already a common concept in the training dataset; then \({x}^u\) is moved to the common-concepts set as shown in Steps 8 to 10. However, if \(sim( {vc,\ \ f{m}_{{x}^u}} ) < {t}_{sim}\) , the most similar image patch in \({x}^u{\rm{\ }}\) is still not very similar to the already learned visual concept \(vc\) ; then \({x}^u\) is moved to the unseen concepts set in Step 12 if UCS has fewer than \(\delta\) images.
If the closest visual concept to \({x}^u\) has a coverage score of less than \({t}_{Cove}\) and a uniqueness score of at least \({t}_{Uniq}\) , we put \({x}^u\) in the rare concepts set, as shown in Steps 13 and 14. A low coverage score (less than \({t}_{Cove}\) ) indicates that the visual concept present in \({x}^u\) exists only in a few images of \({X}^l\) , and thus is rare. A uniqueness score of at least \({t}_{Uniq}\) guarantees that the closest visual concept belongs only to one class, i.e., rare or common class. Finally, the domain expert labels the images in the four sets and moves the images from \({X}^u\) to \({X}^l\) for the next iteration of active learning. VisActive can be generalized to more than two classes by creating XCS, CCS, UCS, and RCS for each class. Note that the existing DAL methods also require either the threshold value for uncertainty or the total number of images to recommend ( \(4\ \times \delta\) value in our case) [14, 23, 26, 36, 37].

4 Experimental Results

We conducted two sets of experiments to evaluate VisActive against several recent DAL methods. One set is for binary classification with one rare and one common class. The common class has more images and is much more diverse than the rare class. The other set is for muti-class classification with one rare class and multiple common classes. We describe the datasets, the comparison methods, and the experimental setup. We then report the results and discuss four ablation studies. For all the experiments, we used ResNet18 implementation [8, 43] to obtain image classification results, as done in [23, 36, 37, 44].

4.1 Active Learning for Binary Classification

4.1.1 Datasets.

We used two medical datasets (Kvasir V2 [45] and the Forceps [46] dataset) and one non-medical dataset (Caltech-256 [47]). The Forceps dataset is an unbalanced dataset without any patient identifiable information and represents a real-world imbalanced dataset. This dataset and Caltech-256 were used in [14]. The Forceps dataset includes frames sampled from 228 full-length deidentified colonoscopy videos covering 95 endoscopy hours. The videos have a frame rate of 29.97 frames per second at a resolution of 720 × 480 pixels. Table 3 shows the number of images of the rare and common classes for each dataset. The rare class of this dataset has images that contain forceps instruments, while the common class has images without the forceps instrument. The Caltech-256 and Kvasir datasets are open datasets used for the repeatability of experiments. However, they are class balanced and have more than two classes. We synthesized one imbalanced dataset from each of them. We selected one of the classes of the original dataset as the rare class and assigned all the images of the other classes in that dataset to the common class of the synthesized dataset, as done in [23]. This is to increase the common class size and diversity. The rare class of the Kvasir dataset is the polyp class. The rare class of the Caltech-256 dataset is the airplane class as in [14]. We denote the imbalanced datasets as Kvasir{Polyp}, Forceps{Forceps}, and Caltech{plane}, where the superscript indicates the rare class of the dataset.
Table 3.
DatasetNo. of Rare Class ImagesNo. of Common Class ImagesClass Imbalance Ratio
Kvasir{Polyp}1,0007,0001:7
Forceps{Forceps}6,860303,5571:44
Caltech{plane}80029,8081:37
Table 3. Descriptive Statistics of the Entire Datasets
For each of these datasets, we randomly selected 15% of the images of the rare and common classes for testing the image classifier. This is to get the testing sets to have approximately the same imbalance distribution as the training dataset. The rest were used for training. The training and testing sets are mutually exclusive. This is to satisfy the Independently and Identically Distributed Data assumption for training and testing data. We used the testing dataset for validation. We created a small balanced seed training dataset by random sampling without replacement 10% of the training images of the rare class. We randomly selected the same number of images from the common class. The remaining images of the training dataset were available for recommendation by the methods under study. The balanced seed dataset is recommended so that the model does not learn features that favor the common class. Figure 3 shows example images of the rare class and common class of Kvasir{Polyp} and Forceps{Forceps}.
Fig. 3.
Fig. 3. Example images of the rare class (top row) and common class (bottom row) of Forceps{Forceps} (a) and Kvasir{Polyp} (b).

4.1.2 Comparison Methods and Experimental Setup.

We compared VisActive against the state-of-the-art DAL methods designed for class imbalance: SAL (our prior work) [14], TA-VAAL [37], and PT4AL [29]. We selected these methods based on the availability of the source code and the best reported results in their category: PT4AL for the uncertainty approach, SAL for the feature-based approach, and TA-VAAL for the variational autoencoder approach. As a reference, we also provide results for the methods that were not designed to handle class imbalance: AIFT [28] and CEAL [26]. We kept the default values in the original code of these existing methods but set the total number of recommended images for each method to be the same. For SAL, we recommended images based on the feature similarity and did not perform pseudo-labeling. This is to maintain the same number of recommended images per method at each iteration, and thus ensure fairness of comparison. We call this modification SAL*. We implemented a baseline method (Random) that recommends images for labeling using random sampling without replacement. We measured the benchmark performance when training was performed on all the images in the training set. We expected the benchmark model to show the best classification performance among all models since training on the entire training dataset is expected to obtain the best accuracy.
We implemented VisActive using Python 3.7 and TensorFlow version 2.5 [48]. For the CL model shown in Algorithm 1, we set the number of visual concepts per class \(m\) to 10 as recommended by [21]. For the VC layer, the values of \(\lambda\) , \(h\) , and \(w\) were 0.7, 1, and 1, respectively, as used in the original paper. We set the CL loss function multiplier \({{\rm{\Gamma }}}_e\) to 1, \({{\rm{\Gamma }}}_d\) to 0.1, and \({{\rm{\Gamma }}}_c\) to 0.8 following [21]. We used a stochastic gradient descent optimizer with a learning rate of 0.001 and a momentum of 0.9. We empirically set the number of epochs of the CL training algorithm \(e1\) to 20, \(e2\) to 20, and \(e3\) to 60. These values were sufficient for the CL model to learn the important visual concepts from the training data. Also, the number of epochs depends on the training dataset size and the CL model complexity. For the parameters of VisActive shown in Algorithm 2, we set the coverage score threshold \({t}_{Cove}\) to 0.2, the uniqueness score threshold \({t}_{Uniq}\) to 1, the number of closest visual concepts \({t}_{clos}\) to 6, and the similarity score threshold \({t}_{sim}\) to 0.5. These values were determined empirically. See Section 4.4 for an ablation study of the various components of VisActive. For each iteration of SAL* and VisActive, we set the number of recommended images per class to the number of rare class images in the initial dataset. Hence, \(4\delta\) images were recommended per class in one iteration. Each of the remaining methods that do not recommend the same number of images per class recommended \(| C | \times \ 4\delta\) images per iteration, where \(| C |\) is the number of classes. For the image classifier, we trained ResNet18 for 200 epochs with a batch size of 128. We used stochastic gradient descent as an optimizer with a learning rate of 0.1 decreased to 0.01 after 160 epochs, and momentum of 0.9.
For SAL*, AIFT, CEAL, TA-VAAL, and PT4AL, we used the authors’ implementation with the same hyperparameter values as listed in the original papers. We used the same initial dataset for all the methods. The same unlabeled image dataset was available to all the methods for recommendation. We also used the same hyperparameter settings for all the models when training the image classifiers. Each point for each method reported in Figure 4 is the mean macro F1-scores of five ResNet classifiers, each trained on the same training dataset and tested on the test dataset mentioned previously. Except for the initial training dataset, the different methods created different training datasets in each iteration, although they have the same number of images. Each sub-figure of Figure 4 involves at least 210 runs, each consisting of training followed by testing of ResNet18. We used macro F1-scores as done in [14, 49] to avoid favoring the common class. We did not use classification accuracy as it is a biased performance metric under class imbalance [19]. We stopped each method after six iterations.
Fig. 4.
Fig. 4. Classification performance evaluation of VisAcitve in comparison to SAL*, CEAL, AIFT, TA-VAAL, PT4AL, and Random methods trained on (a) Kvasir{Polyp}, (b) Forceps{Forceps}, and (c) Caltech{Plane} datasets. The horizontal black dashed line represents the benchmark classification performance when using all images in the training set.

4.1.3 Experimental Results.

We report the macro F1-scores and the recall of the rare class.

4.1.3.1 Classification Performance.

Figure 4 illustrates the macro F1-score at different iterations on the three datasets. The performance at iteration 0 was obtained from training the ResNet18 classifier with the seed training dataset, which is class balanced. The initial dataset is typically small, incurring a relatively small amount of manual labeling effort for the benefit of obtaining an accurate classifier. The horizontal black dashed line represents the benchmark performance, the best performance when training the classifier with the entire training dataset.
On Kvasir{Polyp}, the dataset with a mild imbalance ratio of 1:7, Figure 4(a) shows that VisActive and PT4AL are the two best-performing methods giving ResNet18 classifiers. The two methods take turns winning in different iterations. The macro F1-score of the last iteration when using recommended images by VisActive matches those obtained by SAL*, PT4AL, and CEAL. However, when it comes to the datasets with very high imbalance ratios (over 1:37), Forceps{Forceps} and Caltech{plane}, VisActive recommends better training images, resulting in the faster increase in the ResNet18 classification performance than the rest of the compared DAL methods as shown in Figures 4(b) and 4(c). At the fifth and sixth iterations in Figure 4(c), the training data recommended by VisActive results in the same classification performance as if using the entire training set. PT4AL ranks the second best. TA-VAAL comes in third, followed by SAL*. Note that the common class of Caltech{plane} has a large variety, consisting of images sampled from 255 classes of the original Caltech-256. The visual concepts by VisActive can handle a large variety of patterns in the common class. On Forceps{Forceps}, VisActive outperforms the rest. SAL* obtains the second-best performance, followed by PT4AL and then TA-VAAL.

4.1.3.2 Recall of the Rare Class.

The recall of the rare class is the ratio of the number of correctly labeled images in the rare class to the total number of images in the rare class in the entire unlabeled dataset. SAL* and VisActive were set to recommend 10% of the rare class images and the same number of images for the common class at each iteration. Thus, after six iterations, the maximum rare class recall is 70% if all the recommended rare images are indeed rare class images.
Figure 5 shows the comparison of the rare class recalls (in percent). On Kvasir{Polyp}, VisActive obtains the highest rare class recall of 46%. It comes in second at 40% recall on Forceps{Forceps} and 62% recall on Caltech{Plane}. Trained on Caltech{Plane}, VisActive performs only 1% below SAL*. VisActive recommends images based on diverse image characteristics in its four candidate image sets, which increases the diversity of the labeled dataset. This in turn enhances the decision boundary of the trained image classifier, resulting in improved classification performance, as shown in Figure 4.
Fig. 5.
Fig. 5. The rare-class recall (in percent) after six iterations.

4.2 Active Learning with Multi-class Classification

For multi-common-class classification, we randomly removed 90% of the images of the 1st, 5th, and 10th classes from the training dataset of CIFAR-10 and denote these datasets as CIFAR-10{-1}, CIFAR-10{-5}, and CIFAR-10{-10}, respectively, as done in [23]. The initial labeled dataset has 1,000 images selected per the method described [23], resulting in an imbalanced initial dataset. We evaluated VisActive against TA-VAAL [37], VaB-AL [23], and PT4AL [29], since they are designed for multi-class classification. We used the authors’ implementation with the same hyperparameter values as listed in the original papers. For VisActive, we used the same hyperparameter settings as described in Section 4.1. We randomly selected 100 images per class for the initial dataset. We used the same initial dataset for all the methods. For each active learning method, we set the number of recommended images per iteration to 1,000 as done in [23, 29]. We trained and tested the ResNet18 classifier on each dataset five times and reported the mean accuracy on the balanced CIFAR-10 testing dataset as done in [23, 36, 37].
Figure 6 illustrates that VisActive recommends better training data that results in the best classification performance in all iterations. VisActive achieves a mean accuracy of 0.80 with 6,000 labeled images, 13% of the size of the benchmark training dataset. VisActive performs well with the imbalanced initial training data. PT4AL results are different from the ones reported in [29] since PT4AL has its own method of selecting the initial labeled dataset. For a fair comparison in this study, we used the same initial labeled dataset for all the compared methods. With 6,000 labeled images, the classifier using the training data recommended by PT4AL offers the lowest accuracy at 0.68. VaB-AL has a slow start at first, even lower than PT4AL at 2,000 labeled images, but is able to surpass PT4AL and finally catch up with TA-VAAL.
Fig. 6.
Fig. 6. Classification mean accuracy when trained on CIFAR-10{-1}, CIFAR-10{-5}, and CIFAR-10{-10}. The horizontal black dashed line represents the benchmark classification performance when using all images in the training set.

4.3 Ablation Studies

We performed four ablation studies to gain insights into VisActive. The first two studies are to understand the effectiveness of the four recommended sets. The last two studies investigate how important the learned visual concepts are to classification performance. We have seen the importance of these concepts implicitly since VisActive is able to find diverse training data to improve classification performance better than other comparison methods, especially when the unlabeled dataset is highly imbalanced. We used the experimental setup as in Section 4.1.2 with the balanced seed training dataset.

4.3.1 Rare Image Distribution across Recommended Sets.

The goal of this study is to see the distribution of the rare class images across the four sets of images recommended by VisActive. Table 4 shows the percentage of correctly recommended rare-class images for each set after the sixth iteration to the total number of correctly recommended rare-class images. VisActive was set to recommend 10% of the rare class images at each iteration. Thus, after six iterations, the maximum proportion of the rare class images is 70% of the total number of rare images in the training dataset. We have three observations: (1) CCS of the rare class is the largest among the four recommended sets for the Kvasir{Polyp} and Forceps{Forceps} datasets. Common concepts are more prevalent. Therefore, it is expected that VisActive finds more unlabeled images with common concepts of the rare class. On the other hand, RCS is the largest for Caltech{Plane}. The airplane class has a variety of airplane images with different colors, shapes, poses, and locations, i.e., on the ground or in the air. A few of these images are in the initial labeled dataset, but more are in the unlabeled dataset. (2) XCS is larger for Kvasir{Polyp} and Forceps{Forceps} than that of Caltech{Plane}. It is more difficult to discriminate between the rare and the common classes for medical images than the non-medical images. For instance, the color of polyps in Kvasir{Polyp} is in different shades of red, similar to another anatomical structure of the colon, whereas the color of different airplane images in Caltech{Plane} can be very different. (3) For Caltech{Plane}, UCS is slightly larger than XCS. This is due to the diversity of the visual concepts found in the rare-class images in this dataset.
Table 4.
 CCSRCSXCSUCS
Kvasir{Polyp}35%31%20%14%
Forceps{Forceps}42%33%16%9%
Caltech{Plane}33%51%7%9%
Table 4. Rare Images Distribution across the Four Sets Recommended by VisActive

4.3.2 Impact of Individual VisActive's Candidate Sets.

We analyzed the impact of individual candidate image sets recommended by VisActive based on the ResNet18 macro F1-scores. We introduced four variants of VisActive and denote them as follows: VisActive{+CCS}, VisActive{+RCS}, VisActive{+UCS}, and VisActive{+XCS}. The plus superscript indicates a specific candidate set. For instance, VisActive{+RCS} recommends only the rare concepts set. We set the number of recommended images of a given candidate set to the total number of rare class images in the initial labeled dataset. The benchmark denotes the original VisActive that recommends all four image sets.
Figure 7 shows the average macro F1-score of the ResNet18 classifier trained on images recommended by VisActive and its variants on the Kvasir{Polyp} dataset. Hyperparameter settings were the same as described in Section 4.1.2. Among the variants, the image classifier trained on images recommended by VisActive{+XCS} offers the best F1-score in almost all the iterations. Note that on the second and third iterations, the F1-scores are the same for VisActive{+XCS} and VisActive. This shows the importance of recommendations with the concepts shared in both classes. As a result, we have a better classifier to correctly classify images that fall closer to the classification boundary between the two classes. Among the variants, VisActive{+UCS} results in the second-best F1-score at the fourth and fifth iterations of the labeled images and the best F1-score at the sixth iteration when 70% of the rare class images have been labeled. UCS includes images with unseen visual concepts, i.e., the concepts that have not been included in the training dataset. Recommending these images for labeling and training enhances the classification performance. With the recommended images by VisActive{+CCS}, the image classifier yields the lowest F1-score in comparison with the other VisActive variants. The F1-score increases after the third iteration. After the third iteration, CCS and RCS maintain a similar proportion of images with common and rare concepts.
Fig. 7.
Fig. 7. Average macro F1-scores of the classifier trained on images recommended by VisActive variants on Kvasir{Polyp}.
Figure 8 reports the average recall of the rare class when using each of the VisActive variants for recommendation on Kvasir{Polyp}. We measured the recall of the rare class at each iteration and averaged them over six iterations. Figure 8 illustrates that VisActive{+CCS} has the highest recall of the rare class at 51%. This is expected as common visual concepts are more prevalent. VisActive{+RCS} results in the second-highest recall of the rare class at 39%. This is due to the large number of images with varieties of rare visual concepts. VisActive{+UCS} comes in third with 36%, while VisActive{+XCS} obtains the least recall of 26%.
Fig. 8.
Fig. 8. The average rare class recall across six iterations of recommendation by the VisActive variants on Kvasir{Polyp}.

4.3.3 Faithfulness of Visual Concepts.

We investigated whether the generated visual concepts of the concept learner CL are indeed generalized feature representations of a class. If an image patch closest to one of the learned visual concepts of a class is important for a correct classification of the image to that class, zeroing out all the pixels corresponding to this image patch should cause a drop in classification confidence when the perturbed image is presented to the image classifier. We define the faithfulness score as the percentage of training images where the classification confidence drops after some perturbation, as defined in Equation (8). Our definition of faithfulness score is similar to those in [50, 51] and represents whether the image patches closest to the visual concepts of the class are utilized in the class prediction. The higher the faithfulness score, the more the zeroed-out pixels affect the classification decision.
\begin{equation} Faithfulness\ Score = \ \frac{{\mathop \sum \nolimits_{x \in X} {\mathbb{I}}_x}}{{\left| X \right|}} \times \ 100, \end{equation}
(8)
where \({\mathbb{I}}_x\) is an indicator function that evaluates to 1 if the classification confidence of image \(x\) drops after some perturbation, and 0 otherwise. \(X\) is a set of images and \(| X |\) is the total number of images in the set.
We also define the relative confidence drop, similar to [52], as the average drop in classification confidence after some perturbation, as defined in Equation (9). The more the drop in relative confidence after perturbation, the better the visual concept in generalizing important image features of the class. The relative confidence score value is from 0 to 1. The higher the relative confidence drop, the more important the visual concept is to the classification performance.
\begin{equation} Relative\ Confidence\ Drop = \ \frac{1}{{\left| X \right|}}\mathop \sum \limits_{x \in X} max\left( {0,\ {y}_x - {{\hat{y}}}_x} \right), \end{equation}
(9)
where \({y}_x\) is the confidence score of the correctly classified image \(x\) and \({\hat{y}}_x\) is the confidence score of \(x\) for the same predicted class after perturbation.
We experimented with two methods, CL-Perturb and Random-Perturb, for each image. CL-Perturb zeroes out the pixels of the image patch closest to the learned visual concepts. Random Perturb randomly zeroes out image patches. The number of perturbed image patches by the two methods for each training image is the same. This number is equal to the number of image patches closest to the visual concepts for the predicted class for that image. We ran Random-Perturb five times and reported the average, maximum, and minimum faithfulness scores. They are denoted as Random-Perturb (Avg), Random-Perturb (Max), and Random-Perturb (Min), respectively, in Table 5.
Table 5.
Perturbation MethodFaithfulness ScoreRelative Confidence Drop
CL-Perturb93%0.31
Random-Perturb (Avg)68%0.16
Random-Perturb (Max)74%0.19
Random-Perturb (Min)59%0.14
Table 5. Faithfulness Score and Relative Confidence of the Visual Concept with ResNet18 Trained on Kvasir{Polyp}
With CL-Perturb, the faithfulness score of the ResNet18 classifier on the Kvasir{Polyp} dataset is 93%. This high faithfulness score shows the importance of the learned visual concepts for the classification decision by the classifier. With Random Perturb, the average faithfulness score of the ResNet18 is 68%, a reduction of 25% when compared to CL-Perturb. Hence, the learned visual concepts using the concept learner are much more relevant to the classification decision made by the ResNet18 classifier than randomly chosen image patches. Table 5 also shows that CL-Perturb has the highest relative confidence drop of 0.31, almost double the drop of the average Random-Perturb at 0.16.

4.3.4 Exploring Uniqueness and Coverage Scores of Visual Concepts.

Table 6 shows the maximum uniqueness scores and the maximum coverage scores of the visual concepts of each class by the concept learner trained on Kvasir{Polyp}. For each visual concept, we calculated the uniqueness and coverage scores using Equations (6) and (7), respectively, and the maximum uniqueness and coverage scores for each class. A high maximum uniqueness score indicates that a visual concept of the target class only represents the features of the images that belong to that class. Similarly, a high maximum coverage score per class is desirable as it indicates that some visual concepts of the class cover a large number of training images of that class. Table 6 shows a maximum uniqueness score of 1 for both the rare and common classes. This indicates that the visual concepts excel in learning generalized image features that belong to their corresponding class. The maximum coverage scores of the rare and common classes reflect that the visual concepts can cover many images of their class.
Table 6.
 Rare ClassCommon Class
Maximum Uniqueness11
Maximum Coverage0.650.78
Table 6. Maximum Uniqueness and Coverage Scores of the Visual Concepts on Kvasir{Polyp}

5 Conclusion and Future Work

We propose VisActive, a deep active learning method for classification tasks under class imbalance. VisActive outperforms the state-of-the-art active learning methods in F1-scores given a balanced seed training dataset. VisActive is the second best in offering a high recall for the rare class. Given an imbalanced seed dataset with not many labeled rare class images to learn from, VisActive still performs well. The training algorithm of the concept learner component of VisActive is effective. VisActive enables fine-grained control of the concepts to recommend. Our ablation study shows that recommending images with cross-concepts improves the classification performance in early iterations. Once we have larger training data, recommending images with concepts not already in the training data leads to more performance improvement since it adds diversity to the labeled dataset. After a certain point, maintaining the proportion of images with common and rare concepts is good. A future extension is to learn the visual concepts of the unlabeled images to enhance the diversity of the labeled dataset and improve classification performance.

Acknowledgments

Tavanapong, Wong, and Oh have equity interest and management roles in EndoMetric Corp. Dr. de Groen serves on the Scientific Advisory Board of EndoMetric Corp. Findings, opinions, and conclusions expressed in this article do not necessarily reflect the view of the funding agency.

References

[1]
Joshua Raj, Jeya Shobana, Irina Valeryevna Pustokhina, Denis Alexandrovich Pustokhin, Deepak Gupta, and K. Shankar. 2020. Optimal feature selection-based medical image classification using deep learning model in internet of medical things. IEEE Access 8 (2020), 58006–58017.
[2]
Weibin Wang, Dong Liang, Qingqing Chen, Yutaro Iwamoto, Xian-Hua Han, Qiaowei Zhang, Hongjie Hu, Lanfen Lin, and Yen-Wei Chen. 2020. Medical image classification using deep learning. In Deep Learning in Healthcare: Paradigms and Applications, Y. W. Chen and L. Jain (Eds.). Springer, 33–51. https://link.springer.com/chapter/10.1007/978-3-030-32606-7_3#citeas
[3]
Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B. Gupta, Xiaojiang Chen, and Xin Wang. 2021. A survey of deep active learning. ACM Computing Surveys. 54, 9 (2021), 1–40.
[4]
Byungjae Lee and Kyunghyun Paeng. 2018. A robust and effective approach towards accurate metastasis detection and pN-stage classification in breast cancer. In Proceedings of International Conference on Medical Image Computing and Computer Assisted Intervention. 841–850.
[5]
Ju Gang Nam, Sunggyun Park, Eui Jin Hwang, Jong Hyuk Lee, Kwang-Nam Jin, Kun Young Lim, Thienkai Huy Vu, Jae Ho Sohn, Sangheum Hwang, Jin Mo Goo, and Chang Min Park. 2018. Development and validation of deep learning-based automatic detection algorithm for malignant pulmonary nodules on chest radiographs. Radiology. 290, 1 (2018), 218–228.
[6]
B. Settles. 2012. Active Learning. Morgan & Claypool Publishers.
[7]
Maria-Florina Balcan, Alina Beygelzimer, and John Langford. 2009. Agnostic active learning. Journal of Computational System Science 75 (2009), 78–89.
[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[9]
Alexander Freytag, Erik Rodner, and Joachim Denzler. 2014. Selecting influential examples: Active learning with expected model output changes. In Proceedings of European Conference on Computer Vision. 562–577.
[10]
Xuefeng Du, Dexing Zhong, and Huikai Shao. 2019. Building an active palmprint recognition system. In Proceedings of IEEE International Conference on Image Processing. 1685–1689.
[11]
Ian H. Witten, Eibe Frank, Mark A. Hall, and Christopher J. Pal. 2017. Data Mining: Practical Machine Learning Tools and Techniques. Elsevier - Morgan Kaufmann. 4.
[12]
Denis Gudovskiy, Alec Hodgkinson, Takuya Yamaguchi, and Sotaro Tsukizawa. 2020. Deep active learning for biased datasets via Fisher kernel self-supervision. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 9041–9049.
[13]
Yue Huang, Zhenwei Liu, Minghui Jiang, Xian Yu, and Xinghao Ding. 2020. Cost-effective vehicle type recognition in surveillance images with deep active learning and web data. IEEE Transactions on Intelligent Transportation Systems 21, 1 (2020), 79–86.
[14]
Chuanhai Zhang, Wallapak Tavanapong, Gavin Kijkul, Johnny Wong, Piet C. de Groen, and JungHwan Oh. 2018. Similarity-based active learning for image classification under class imbalance. In Proceedings of IEEE International Conference on Data Mining. 1422–1427.
[15]
Qing-Yan Yin, Jiang-She Zhang, Chun-Xia Zhang, and Sheng-Cai Liu. 2013. An empirical study on the performance of cost-sensitive boosting algorithms with different levels of class imbalance. Mathematical Problems in Engineering. Vol. 2013.
[16]
Haibo He and Edwardo A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21, 9 (2009), 1263–1284.
[17]
Vaishali Ganganwar. 2012. An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering 2, 4 (2012), 42--47.
[18]
Paula Branco, Luis Torgo, and Rita P. Ribeiro. 2016. A survey of predictive modeling on imbalanced domains. ACM Computing Surveys 49 (2016), 1–50.
[19]
Justin Johnson and Taghi Khoshgoftaar. 2019. Survey on deep learning with class imbalance. Journal of Big Data 6, 1 (2019), 1–54.
[20]
Gregory Cooper, Amitabh Chak, and Siran Koroukian. 2005. The polyp detection rate of colonoscopy: A national study of Medicare beneficiaries. American Journal of Medicine 118, 12 (2005), 1413–1424.
[21]
Mohammed Khaleel, Wallapak Tavanapong, Johnny Wong, Junghwan Oh, and Piet De Groen. 2021. Hierarchical visual concept interpretation for medical image classification. In Proceedings of IEEE International Symposium on Computer-Based Medical Systems. 25–30.
[22]
Guideline: American Society for Gastrointestinal Endoscopy. 2015. Quality indicators for GI endoscopic procedures. Gastrointestinal Endoscopy 81, 1 (2015), 31–53.
[23]
Jongwon Choi, Kwang Moo Yi, Jihoon Kim, Jinho Choo, Byoungjip Kim, Jinyeop Chang, Youngjune Gwon, and Hyung Jin Chang. 2021. VaB-AL: Incorporating class imbalance and difficulty with variational Bayes for active learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 6749–6758.
[24]
Ajay Joshi, Fatih Porikli, and Nikolaos Papanikolopoulos. 2009. Multi-class active learning for image classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2372–2379.
[25]
William Beluch, Tim Genewein, Andreas Nurnberger, and Jan M. Kohler. 2018. The power of ensembles for active learning in image classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 9368–9377.
[26]
Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. 2017. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology 27, 12 (2017), 2591–2600.
[27]
Ozan Sener and Silvio Savarese. 2018. Active learning for convolutional neural networks: A core-set approach. In Proceedings of International Conference on Learning Representations. 1–13.
[28]
Zongwei Zhou, Jae Shin, Lei Zhang, Suryakanth Gurudu, Michael Gotway, and Jianming Liang. 2017. Fine-tuning convolutional neural networks for biomedical image analysis: Actively and incrementally. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 7340–7351.
[29]
John Seon Keun Yi, Minseok Seo, Jongchan Park, and Dong-Geol Choi. 2022. Using self-supervised pretext tasks for active learning. In Computer Vision (ECCV’22), S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, T. Hassner (Eds.). Lecture Notes in Computer Science, Springer, Cham. 13686.
[30]
Donggeun Yoo and In So Kweon. 2019. Learning loss for active learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 93–102.
[31]
Yarin Gal, Riashat Islam, and Zoubin Ghahramani. 2017. Deep Bayesian active learning with image data. In Proceedings of International Conference on Machine Learning. 1183–1192.
[32]
Karl Weiss, Taghi M. Khoshgoftaar, and DingDing Wang. 2016. A survey of transfer learning. Journal of Big Data 3, 1 (2016), 1–40.
[33]
Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
[34]
Reza Zanjirani Farahani and Masoud Hekmatfar. 2009. Facility Location: Concepts, Models, Algorithms and Case Studies. Springer Science and Business Media.
[35]
Ofer Yehuda, Avihu Dekel, Guy Hacohen, and Daphna Weinshall. 2022. Active learning through a covering lens. In Proceedings of Conference on Neural Information Processing Systems. 1–19.
[36]
Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. 2019. Variational adversarial active learning. In Proceedings of IEEE International Conference on Computer Vision. 5972–5981.
[37]
Kwanyoung Kim, Dongwon Park, Kwang Kim, and Se Young Chun. 2021. Task-aware variational adversarial active learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 8166–8175.
[38]
Shie Mannor, Reuven Rubinstein, and Yohai Gat. 2005. The cross-entropy method for classification. In Proceedings of International Conference on Machine Learning. 561–568.
[39]
Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and Jonathan K. Su. 2019. This looks like that: Deep learning for interpretable image recognition. In Proceedings of Conference on Neural Information Processing Systems. 8928–8939.
[40]
Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. 2018. A survey on deep transfer learning. In Proceedings of International Conference on Artificial Neural Networks. 270–279.
[41]
Yarin Gal and Zoubin Ghahramani. 2015. Bayesian convolutional neural networks with Bernoulli approximate variational inference. arXiv preprint.
[42]
Sinno Jialin Pan, and Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10 (2009), 1345–1359.
[43]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of IEEE International Conference on Computer Vision. 1026–1034.
[44]
Azeez Idris. Website. https://github.com/azibit/pytorch-cifar, accessed in November 2021.
[45]
Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, et al. 2017. Kvasir: A multi-class image dataset for computer-aided gastrointestinal disease detection. In Proceedings of ACM Multimedia Systems. 164–169.
[46]
Wallapak Tavanapong, Johnny Wong, Piet C. de Groen, and JungHwan Oh. 2021. Endoscopy Dataset, Iowa State University. https://cml.cs.iastate.edu/EMIS
[47]
Gregory Griffin, Alex Holub, and Pietro Perona. 2007. Caltech-256 object category dataset. Caltech Technical Report.
[48]
Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, et al. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. arXiv preprint.
[49]
Kai Ming Ting. 2010. Precision and recall. In Encyclopedia of Machine Learning, Claude Sammut and Geoffrey I. Webb (Eds.). 781–781.
[50]
Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. In Proceedings of Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing. 11–20.
[51]
Bernease Herman. 2017. The promise and peril of human evaluation for model interpretability. arXiv preprint.
[52]
Mohammed Khaleel, Lei Qi, Wallapak Tavanapong, Johnny Wong, Adisak Sukul, and David Peterson. 2022. IDC: Quantitative evaluation benchmark of interpretation methods for deep text classification models. Journal of Big Data 9, 1 (2022), 1–34.

Index Terms

  1. VisActive: Visual-concept-based Active Learning for Image Classification under Class Imbalance

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 3
    March 2024
    665 pages
    EISSN:1551-6865
    DOI:10.1145/3613614
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 November 2023
    Online AM: 15 September 2023
    Accepted: 14 August 2023
    Revised: 03 May 2023
    Received: 30 November 2022
    Published in TOMM Volume 20, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Deep active learning
    2. deep neural network
    3. class imbalance

    Qualifiers

    • Research-article

    Funding Sources

    • NIH

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 749
      Total Downloads
    • Downloads (Last 12 months)749
    • Downloads (Last 6 weeks)65
    Reflects downloads up to 02 Sep 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media