We used three datasets to validate our method. Each dataset has a different domain of input data and a different type of annotation. By testing our model in different settings, it can be verified to be useful in many applications.
4.1 Experiment 1: CelebA Dataset
The first dataset is the CelebA dataset [
25]. This is a face attribute dataset including 200K celebrity images, each with 40 attribute annotations such as “Eyeglasses” and “Smiling.” In this experiment, we set “Attractive,” “Heavy Makeup,” “Male,” and “Young” as the attributes to be explained by the rest. These global attributes are selected as targets because they can be intuitively explained by combinations of other local features.
The model architecture is illustrated in Figure
2. The input is an image. We use VGG16 [
38] as a base model for a performer and ResNet152 [
13] for an explainer; both are pretrained on ImageNet [
7].
F and
serve as the performer’s prediction heads that output predictions for each attribute. The explainer’s prediction heads,
g and
, which regress the weight and factor-vector for each attribute, share the same architecture with different parameters following ResNet152. Layers composing these models are listed in Table
1. The number of explanatory attributes
n is 39. The dimension of factor vectors
l is set to 2.
We first train a performer with a cross-entropy loss and then an explainer with the loss function of Equation (
3). In
PriorLoss,
is set to 10.
4.2 Experiment 2: DeepFashion Dataset
The second dataset is the DeepFashion dataset [
24]. The DeepFashion database includes many benchmarks available for various purposes. We select the Category and Attribute Prediction Benchmark because it contains rich annotations suitable for explanations. Coarse annotation has five types of attributes: texture, figure, shape, part, and style. Because the annotation includes as many as 1,000 attributes, we reduce the number of attributes and data. The benchmark includes many kinds of clothes (denim jacket, long skirt, T-shirt, etc.). In order to limit the number of attributes, we choose only data from the tops. Then the 100 most frequent attributes are selected, for example, “Print,” “Knit,” and “Shirt;” the rest are abandoned. As a result, the number of data points is about 140K. In this experiment, we select “Classic,” “Basic,” “Cute,” and “Soft” as attributes to be explained by the other attributes, as they describe clothes’ global features.
The model architecture used in this experiment and its training process is the same as that of Experiment 1. The input is an image. The number of explanatory attributes n is 99. The dimension of factor vectors is the same as that of Experiment 1, which is 2.
4.3 Experiment 3: TV Ads Dataset
The last dataset used is the TV ads dataset, a collection of 14,990 commercial videos that were actually broadcast on TV in Japan between January 2006 and April 2016. Each video was evaluated and annotated by 600 participants. The dataset was collected to predict the following four impressional and emotional effects:
•
Favorability rating (F): how much participants liked the content of the advertisement itself
•
Interest rating (I): how much participants became interested in the product/service
•
Willingness rating (W): how much participants felt like buying the product/service
•
Recognition rating (R): how much participants remembered the advertisement
Besides the videos, the dataset contains metadata such as information about the casts featured in the ad. In addition, scores are given to 26 attributes that describe the ad, such as “Good story” and “Impressive.” In the present experiment, we attempt to explain each of the four effects of the attributes.
The effects and the attributes are continuous values, not binary labels, and the performer’s prediction task is necessarily a regression problem, in contrast to the previous two experiments. Hence, a different architecture is needed. We illustrate the model in Figure
3. Input data consists of frame deep features extracted from video, sound, metadata, cast data, text in frames, and narration data. As the base model for both the performer and the explainer, we employ a multimodal fusion model using an attention mechanism proposed in previous research [
43].
F regresses one of the four effects;
f outputs a vector
(where
n is 26) that represents the predicted attributes. In contrast to the model in Figure
2,
F and
f output the target prediction and explanatory attributes prediction independently. The architecture of the explainer is otherwise similar to that in Figure
2:
g and
h share the base model, and its branches produce
and
. We set
in Equation (
3) to 0; that is, we do not employ
PriorLoss in this experiment.
l is set to 2 similarly.
4.4 Results
We show the accuracy or correlation coefficients of each experiment in Tables
2,
4, and
6. We also show the conditional entropy in Tables
3,
5, and
7. The first row shows the result from the explainer in the method of Chen et al. [
3], the second row shows our method’s explainer, and the third row shows the performer. In the experiment, we compared our method to the previous method to show that feature interaction improves the explainer’s performance. Conditional entropy of explanation presented in the previous work [
3] is not appropriate for evaluating our method because the weights of attributes and their interactions were not approximated as they were in the previous one.
For Experiments 1 and 2, we reimplemented the previous method to use it as a comparative method. There could be a slight difference between our implementation and that of Chen et al. [
3], since their paper misses some details about the model; nevertheless, as the first and third rows in Table
2 show, the performer and explainer implemented by us achieve almost the same performances as those reported in the paper. This implies that our implementation can accurately reproduce the method of Chen et al. [
3].
Table
2 shows the results of the experiment on the CelebA dataset. As mentioned above, our proposed method is compared with an interaction-free method from the literature. From the table it can be seen that explainers perform better with feature interaction whatever the target attribute, and attain accuracies close to those of performers. This indicates that feature interaction is capable of increasing both explainability and the model’s discriminative power at the same time. To verify that our model using interaction can produce reasonable explanations, we display an example in Figures
4 and
5, picked from test data in the CelebA dataset. These are explanations of why the performer judged the image to be “Attractive.” The horizontal axis is the attribute label and the vertical axis is the contribution to the prediction. Figure
4 shows an explanation produced with the method of Chen et al. [
3]. The 20 largest contributions are sorted in descending order. Figure
5 shows an explanation produced by our method. The top row shows the contribution from single attributes and the bottom row shows that from attribute interactions (the 20 largest for each). The previous method already achieved quantitative and semantic explanations; however, ours is able to consider not only single-attribute contributions but also interactions, resulting in more unbiased and insightful explanations. Examining Figure
5 in more detail, it is suggested by the explainer that attributes such as “Double chin” and “Bushy eyebrows” contribute to “Attractive” for this face image and so do the attribute interactions, including “No beard & Young” and “Male & Young.” The explanation is reasonable and easily interpretable by humans. In addition, it is observed in the explanation that contributions made by feature interactions such as “No beard & Young” are larger than those made by single features such as “Double chin.” It can be estimated that the performer highlights the feature interaction when it performs prediction, and our method is able to successfully detect that.
Table
4 shows the results of the experiment on the DeepFashion dataset. As can be seen, introducing interactions to the explainer is not helpful for improving prediction performance in this domain. There are two possible reasons for this. The first is that the annotations are so coarse that the interactions contain more error. Since interaction is the product of the probabilities of two attributes and factor vectors, error tends to be amplified. The second possible reason is that the task itself is so simple that explanation models can easily converge to the optimum regardless of variation of explanatory variables. We find that conditions such as the number of attributes or complexity of the model are strongly related to the quality of explanation and prediction performance and thus need to be carefully designed. For a more detailed analysis, we present an example of explanations produced by the method of Chen et al. and by ours in Figures
6 and
7, respectively. These explanations were produced to explain why the image displayed on the top is “Classic.” For example, according to Figure
6, the second most important reason for being “Classic” is “New York,” while it is hard to determine whether the cloth can be categorized as “New York.” Furthermore, Figure
7 shows that interactions between “Collar & Button” or between “Collar & Pleated” are the most significant factors, although “Button & Pleated” are unseen in the image. Similarly, other examples contain wrong attributes in their explanations.
Table
6 compares prediction results of the experiment on the TV ads dataset. Different from the previous two experiments, the results are evaluated by Pearson’s correlation coefficients, since the targets are continuous values. It can be seen that the explainer achieves higher performance when interactions are incorporated, except on the Favorability rating. This implies that considering interactions is valid for various tasks including regression. Figures
8 and
9 give examples of the explanations produced by the two methods explaining the ad’s Favorability rating.
1 For the interaction-free explanation, attributes such as “Familiar” and “Empathetic” are dominant causes. By contrast, the explanation with interactions similarly takes “Familiar” as one of the most important reasons; however, it is different from the other one in that the second most emphasized attribute is “Celebrity/Character,” which is aligned with our intuition. Although in this case the effects of interaction are much less important than in the other two experiments, our method is able to produce reasonable quantitative and semantic explanations just as the other cases.
Tables
3,
5, and
7 show that the conditional entropy is almost the same as the explainer without interaction as well as the performer. However, as pointed out in [
3], this is not directly related to the ground truth of explanations. We believe that higher accuracy and correlation coefficients are more important because it means the distillation from the performer is more successful.
For more experimental results, please refer to Figures
10,
11,
12, and
13 in the appendix.