Abstract
Zero-shot learning (ZSL) aims to recognize objects of novel classes without any training samples of specific classes, which is achieved by exploiting the semantic information and auxiliary datasets. Recently most ZSL approaches focus on learning visual-semantic embeddings to transfer knowledge from the auxiliary datasets to the novel classes. However, few works study whether the semantic information is discriminative or not for the recognition task. To tackle such problem, we propose a coupled dictionary learning approach to align the visual-semantic structures using the class prototypes, where the discriminative information lying in the visual space is utilized to improve the less discriminative semantic space. Then, zero-shot recognition can be performed in different spaces by the simple nearest neighbor approach using the learned class prototypes. Extensive experiments on four benchmark datasets show the effectiveness of the proposed approach.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Object recognition has made tremendous progress in recent years. With the emergence of large-scale image database [28], deep learning approaches [13, 17, 29, 31] show their great power to recognize objects. However, such supervised learning approaches require large numbers of images to train robust recognition models and can only recognize a fixed number of categories, which limits their flexibility. It is well known that collecting large numbers of images is difficult. On one hand, the numbers of images often follow a long-tailed distribution [41] and it is hard to collect images for some rare categories. On the other hand, some fine-grained annotations require expert knowledge [33], which increases the difficulty of the annotation task. All these challenges motivate the rise of zero-shot learning, where no labeled examples are needed to recognize one category.
Zero-shot learning aims at recognizing objects that have not been seen in the training stage, where auxiliary datasets and semantic information are needed to perform such tasks. It is mainly inspired by the human’s behavior to recognize new objects. For example, children have no problem recognizing zebra if they are told that zebra looks like a horse (auxiliary datasets) but has stripes (semantic information), even though they have never seen zebra before. Current ZSL approaches generally involve three steps. First, choose a semantic space to build up the relations between seen (auxiliary dataset) and unseen (test) classes. Recently the most popular semantic information includes attributes [9, 19] that are manually defined and wordvectors [2, 10] that are automatically extracted from the auxiliary text corpus. Second, learn general visual-semantic embeddings from the auxiliary dataset, where the images and class semantics could be projected into a common space [1, 5]. Third, perform the recognition task in the common space by different metric learning approaches.
Traditional ZSL approaches usually use fixed semantic information and pay much attention to learning more robust visual-semantic embeddings [1, 10, 15, 19, 24, 38]. However, most of these approaches ignore the fact that the semantic information, whether human-defined or automatically extracted, is incomplete and may be not discriminative enough to classify different classes because the descriptions about classes are limited. As is shown in Fig. 1, some classes may locate quite close to each other in the semantic space due to the incomplete descriptions, i.e. cat and dog, thus it may be less effective to perform recognition task in this space. Since images are real reflections of different categories, they may contain more discriminative information that could not be described. Moreover, the semantic information is obtained independently from visual samples so the class structures between the visual space and semantic space are not consistent. In such cases, the visual-semantic embeddings would be too complicated to learn. Even if the embeddings are properly learned, they have large probabilities to overfit the seen classes and have less expansibility to the unseen classes.
In order to tackle such problems, we propose to learn the class prototypes by aligning the visual-semantic structures. The novelty of our framework lies in three aspects. First, different from traditional approaches which learn image embeddings, we perform the structure alignment on the class prototypes, which are automatically learned, to conduct the recognition task. Second, a coupled dictionary learning framework is proposed to align the class structures between visual space and semantic space, where the discriminative property lying in the visual space and the extensive property existing in the semantic space are merged in an aligned space. Third, semantic information of unseen classes is utilized for domain adaptation, which increases the expansibility of our model to the unseen classes. In order to demonstrate the effectiveness of the proposed approach, we perform experiments on four popular datasets for zero-shot recognition, where excellent results are achieved.
2 Related Work
In this section, we review related works on zero-shot learning in three aspects, i.e. semantic information, visual-semantic embeddings, zero-shot recognition.
2.1 Semantic Information
Semantic information plays an important role in zero-shot learning. It builds up the relations between seen and unseen classes, thus making it possible for zero-shot recognition. Recently, the most popular semantic information includes attributes [1, 3, 9, 14, 19] and wordvectors [2, 7, 22]. Attributes are general descriptions of objects which can be shared among different classes. For example, furry can be shared among different animals. Thus it is possible to learn such attributes by some auxiliary classes and apply them to the novel classes for recognition. Wordvectors are automatically extracted from large numbers of text corpus, where the distances between different wordvectors show the relations between different classes, thus they are also capable of building up the relations between seen and unseen classes.
Since the knowledge that could be collected is limited, the semantic information obtained in general purpose is usually less discriminative to classify different classes in specific domains. To tackle such problem, we propose to utilize the discriminative information lying in the visual space to improve the semantic space.
2.2 Visual-Semantic Embeddings
Visual-semantic embedding is the key to zero-shot learning and most existing ZSL approaches focus on learning more robust visual-semantic embeddings. In the early stage, [9, 19] propose to use attribute classifiers to perform ZSL task. Such methods learn each attribute classifier independently, which is not applicable to large-scale datasets with lots of attributes. In order to tackle such problems, label embedding approaches emerge [1, 2], where all attributes are considered as a whole for a class and label embedding functions are learned to maximize the compatibility of images with corresponding class semantics. To improve the performance of such embedding models, [35] proposes latent embedding models, where multiple linear embeddings are learned to approximate non-linear embeddings. Furthermore, [10, 22, 26, 30, 34, 38] exploit deep neural networks to learn more robust visual-semantic transformations.
Although some works pay attention to learning more complicated embedding functions, some other works deal with the visual-semantic transformation problem from different views. [23] forms the semantic information of unseen samples by a convex combination of seen-class semantics. [39, 40] utilize the class similarities and [14] proposes discriminative latent attributes to form more effective embedding space. [4] synthesizes the unseen-class classifiers by sharing the structures between the semantic space and the visual space. [5, 20] predicts the visual exemplars by learning embedding functions from the semantic space to the visual space. [3] exploits metric learning techniques, where relative distance is utilized, to improve the embedding models. [27] views the image classifier as a function of corresponding class semantic and uses additional regularizer to learn the embedding functions. [16] utilizes the auto-encoder framework to learn the visual-semantic embeddings. [8] uses low rank constraints to learn semantic dictionaries and [37] proposes a matrix tri-factorization approach with manifold regularizations. To tackle the embedding domain shift problem, [11, 15] use the transfer learning techniques to extend ZSL into transductive settings, where the unseen-class samples are also utilized in the training process.
Different from such existing approaches which learn image embeddings or synthesize image classifiers, we propose to learn the class prototypes by jointly aligning the class structures between the visual space and the semantic space.
2.3 Zero-Shot Recognition
The most widely used approaches for zero-shot recognition are probability models [19] and nearest neighbour classifiers [1, 14, 39]. To make use of the rich intrinsic structures on the semantic manifold, [12] proposes semantic manifold distance to recognize the unseen class samples and [4] directly synthesizes the image classifiers of unseen classes in the visual space by sharing the structures between the semantic space and the visual space. Considering more real conditions, [6] expands the traditional ZSL problem to the generalized ZSL problem, where the seen classes are also considered in the test procedure. Recently, [36] proposes more reasonable data splits for different datasets and evaluates the performance of different approaches under such experiment settings.
3 Approaches
The general idea of the proposed approach is to learn the class prototypes by sharing the structures between the visual space and the semantic space. However, the structures between these two spaces may be inconsistent, since the semantic information is obtained independently of the visual examples. In order to tackle such problem, we propose a coupled dictionary learning (CDL) framework to simultaneously align the visual-semantic structures. Thus the discriminative information in the visual space and the relations in the semantic space can be shared to benefit each other. Figure 2 shows the framework of our approach. There are three key submodules of the proposed framework: prototype learning, structure alignment, and domain adaptation.
3.1 Problem Formulation
Assume a labeled training dataset contains K seen classes with \(n_s\) labeled samples \( \mathcal {S} = \{(x_i, y_i)| x_i \in \mathcal {X}, y_i \in \mathcal {Y}^{s}\}_{i=1}^{n_s}\), where \(x_i \in \mathbb {R}^d \) represents the image feature and \(y_i\) denotes the class label in \(\mathcal {Y}^{s}=\{s_1,...,s_K\}\). In addition, a disjoint class label set \(\mathcal {Y}^{u}=\{u_1,...,u_L\}\), which consists L unseen classes, is provided, i.e. , but the corresponding images are missing. Given the class semantics \(\mathcal {C} = \{\mathcal {C}^s \bigcup \mathcal {C}^u\}\), the goal of ZSL is to learn image classifiers \(f_{zsl} : \mathcal {X} \rightarrow \mathcal {Y}^{u}\).
3.2 Framework
As is shown in Fig. 2, our framework contains three submodules: prototype learning, structure alignment and domain adaptation.
Prototype Learning. The structure alignment approach proposed by our framework is performed on the class prototypes. In order to align the class structures between the visual space and the semantic space, we must first obtain the class prototypes in both spaces. In the semantic space, we denote the class prototypes of seen/unseen classes as \(C_s \in \mathbb {R}^{m \times K}\)/\(C_u \in \mathbb {R}^{m \times L}\), where m is the dimension of the semantic space. Here, \(C_s\)/\(C_u\) can be directly set as \(\mathcal {C}^s\)/\(\mathcal {C}^u\). However, in the visual space, only the seen-class samples \(X_s \in \mathbb {R}^{d \times n_s}\) and their corresponding labels \(Y_s\) are provided, so we should first learn the class prototypes \(P_s \in \mathbb {R}^{d \times K}\) in the visual space, where d is the dimension of the visual space. The basic idea for prototype learning is that samples should locate near their corresponding class prototypes in the visual space, so the loss function can be formulated as:
where each column in \(H \in \mathbb {R}^{K \times n_s}\) is a one-hot vector indicating the class label of corresponding image.
Structure Alignment. Due to the fact that the semantic information of classes is defined or extracted independently of the images, directly sharing the structures in the semantic space to form the prototypes of unseen classes in the visual space is not a good choice, where structure alignment should be performed first. Therefore, we propose a coupled dictionary learning framework to align the visual-semantic structures. The basic idea for our structure alignment approach is to find some bases in each space to represent each class and enforce the new representation to be the same in the two spaces, thus the structures can be aligned. The loss function is formulated as:
where \(P_s\) and \(C_s\) are the prototypes of seen classes in the visual and semantic space respectively. \(D_1 \in \mathbb {R}^{d \times n_b}\) and \(D_2 \in \mathbb {R}^{m \times n_b}\) are the bases in corresponding spaces, where d, m are the dimensions of visual space and semantic space respectively and \(n_b\) is the number of bases. \(Z_s \in \mathbb {R}^{n_b \times K}\) is the common new representation of seen classes, and it just plays the key role to align the two spaces. \(\lambda \) is a parameter controlling the relative importance of the visual space and semantic space. \(\varvec{d}_1^i\) denotes the i-th column of \(D_1\) and \(\varvec{d}_2^i\) is the i-th column of \(D_2\). By exploring new representation bases in each space to reformulate each class, we obtain the same class representations for the visual and semantic spaces, thus the class structures in the two spaces will be consistent.
Domain Adaptation. In the structure alignment process, only seen-class prototypes are utilized and this may cause the domain shift problem [11]. In other words, a general structure alignment approach learned on seen classes may not be appropriate for the unseen classes, since there are some differences between seen and unseen classes. To tackle such problem, we further propose a domain adaptation term, which automatically learns the unseen-class prototypes in the visual space and uses the unseen prototypes to assist the structure learning process. The loss function can be formulated as:
where \(P_u \in \mathbb {R}^{d \times L}\) and \(C_u \in \mathbb {R}^{m \times L}\) are the prototypes of unseen classes in the visual and semantic space respectively, and \(Z_u \in \mathbb {R}^{n_b \times L}\) is the common new representation of unseen classes.
In a whole, our full objective can be formulated as:
where \(\alpha \) and \(\beta \) are the parameters controlling the relative importance.
3.3 Optimization
The final loss function of the proposed framework can be formulated as:
It is obvious that Eq. 5 is not convex for \(P_s, P_u, D_1, D_2, Z_s\) and \(Z_u\) simultaneously, but it is convex for each of them separately. We thus employ an alternating optimization method to solve the problem.
Initialization. In our framework, we set the number of dictionary bases \(n_b\) as the number of seen classes K and enforces each column of Z to be the similarities to all seen classes. First, we initialize \(Z_u \in \mathbb {R}^{K \times L}\) as the similarities of unseen classes to the seen classes, i.e. cosine distances between unseen and seen class prototypes in the semantic space. Second, we get \(D_2\) by the second term of Eq. 3, which has closed-form solution. Third, we get \(Z_s\) by the second term of Eq. 2. Next, we initialize \(P_s\) as the mean of samples in each class. Then, we get \(D_1\) by the first term of Eq. 2. In the end, we get \(P_u\) by the first term in Eq. 3. In this way, all the variables in our framework are initialized.
Joint Optimization. After all variables in our framework are initialized separately, we jointly optimize them as follows:
-
(1)
Fix \(D_1,Z_s\) and update \(P_s\). The subproblem can be formulated as:
$$\begin{aligned} {\begin{matrix} \arg \min \limits _{P_s} \left\| P_s - D_1Z_s \right\| _{F}^{2} + \beta \left\| X_s - P_sH \right\| _{F}^{2} \end{matrix}} \end{aligned}$$(6) -
(2)
Fix \(P_s,D_1,D_2\) and update \(Z_s\) by Eq. 2.
-
(3)
Fix \(P_s,P_u,Z_s,Z_u\) and update \(D_1\). The subproblem can be formulated as:
$$\begin{aligned} {\begin{matrix} \arg \min \limits _{D_1} \left\| P_s - D_1Z_s \right\| _{F}^{2} + \alpha \left\| P_u - D_1Z_u \right\| _{F}^{2} \quad s.t. \quad ||\varvec{d}_1^i||_2^2 \le 1, \forall i. \end{matrix}} \end{aligned}$$(7) -
(4)
Fix \(Z_s,Z_u\) and update \(D_2\). The subproblem can be formulated as:
$$\begin{aligned} {\begin{matrix} \arg \min \limits _{D_1} \left\| C_s - D_2Z_s \right\| _{F}^{2} + \alpha \left\| C_u - D_2Z_u \right\| _{F}^{2} \quad s.t. \quad ||\varvec{d}_2^i||_2^2 \le 1, \forall i. \end{matrix}} \end{aligned}$$(8) -
(5)
Fix \(P_u,D_1,D_2\) and update \(Z_u\) by Eq. 3.
-
(6)
Fix \(D_1,Z_u\) and update \(P_u\) by the first term of Eq. 3.
In our experiments, we set the maximum iterations as 100 and the optimization always converges after tens of iterations, usually less than 50.Footnote 1
3.4 Zero-Shot Recognition
In the proposed framework, we can obtain the prototypes of unseen classes in different spaces (i.e. visual space \(P_u\), aligned space \(Z_u\), semantic space \(C_u\)), where we can perform zero-shot recognition task using nearest neighbour approach.
Recognition in the Visual Space. In the test process, we can directly compute the similarities \(Sim_v\) of test samples (\(X_i\)) to the unseen class prototypes (\(P_u\)), i.e. cosine distance, and classify the images to the classes corresponding to their most similar prototypes.
Recognition in the Aligned Space. To perform recognition task in this space, we must first obtain the representations of images in this space by
where \(X_i\) represents the test images and \(Z_i\) is the corresponding representation in the aligned space. Then we can obtain the similarities \(Sim_a\) of test samples (\(Z_i\)) to the unseen-class prototypes (\(Z_u\)) and use the same recognition approach as that in the visual space.
Recognition in the Semantic Space. First, we should get the semantic representations of images by \(C_i = D_2Z_i\). Then the similarities \(Sim_s\) can be obtained by computing the distances between the test samples (\(C_i\)) and the unseen-class prototypes (\(C_u\)). The recognition task can be performed the same way as that in the visual space.
Combining Multiple Spaces. Due to the fact that the visual space is discriminative, the semantic space is more generative, and the aligned space is a compromise, combining multiple spaces would improve the performance. In our framework, we simply combine the similarities obtained in each space, i.e. combining the visual space and aligned space by \(Sim_{va} = Sim_v + Sim_a\), and use the same nearest neighbour approach to perform recognition task.
3.5 Difference from Relevant Works
Among prior works, the most relevant one to ours is [4], where the structures in the semantic space and visual space are also utilized. However, the key ideas of the two works are quite different. [4] uses fixed semantic information and directly shares its structure to the visual space to form unseen classifiers. It doesn’t consider whether the two spaces are consistent or not since the semantic information is obtained independently of the visual exemplars. While our approach focuses on aligning the visual-semantic structure and then shares the aligned structures to form unseen-class prototypes in different spaces. Moreover, [4] learns visual classifiers independently of the semantic information while our approach automatically learns the class prototypes in the visual space by jointly leveraging the semantic information. Furthermore, to make the model more suitable to the unseen classes to tackle the challenging domain shift problem, which is not addressed in [4], we propose to utilize the unseen-class semantics to make domain adaptation. Another work [34] also uses structure constraints to learn visual-semantic embeddings. However, it deals with the sample structure, where the distances among samples are preserved. While our approach aligns the class structures, which aims to learn more robust class prototypes.
4 Experiments
4.1 Datasets and Settings
Datasets. Following the new data splits proposed by [36], we perform experiments on four bench-mark ZSL datasets, i.e. aPascal & aYahoo (aPY) [9], Animals with Attributes (AwA) [19], Caltech-UCSD Birds-200-2011 (CUB) [32], SUN Attribute (SUNA) [25], to verify the effectiveness of the proposed framework. The statistics of all datasets are shown in Table 1.
Settings. To make fair comparisons, we use the class semantics and image features provided by [36]. Specifically, the attribute vectors are utilized as the class semantics and the image features are extracted by the 101-layered ResNet [13]. Parameters (\(\lambda , \alpha , \beta , \gamma \)) in the proposed framework are fine-tuned in the range [0.001, 0.01, 0.1, 1, 10] using the train and validation splits provided by [36]. More details about the parameters can be seen in the supplementary material. We use the average per-class top-1 accuracy to measure the performance of our models.
4.2 Evaluations of Different Spaces
The proposed framework involves three spaces, i.e. visual space (v), aligned space (a) and semantic space (s). As is described above, zero-shot recognition can be performed in each space independently or in the combined space, and the recognition results are shown in Fig. 3. It can be seen that the performance in the visual space is higher than that in the semantic space, which indicates that the incomplete semantic information is usually less discriminative. By aligning the visual-semantic structures, the discriminative property of the semantic space improves a lot, which can be inferred from the comparisons between the aligned space and the semantic space. Moreover, the recognition performance will be further improved by combining the visual space and the aligned space, since the visual space is more discriminative and the aligned space is more extensive. For AwA, the best performance is obtained in the visual space. Perhaps the visual space is discriminative enough and it is not complementary with other spaces, so combining it with others will pull down its performance.
4.3 Comparison with State-of-the-Art
To demonstrate the effectiveness of the proposed framework, we compare our method with several popular approaches and the recognition results on the four datasets are shown in Table 2. We report our results in the best space for each dataset, as is analyzed in Sect. 4.2. It can be seen that our framework achieves the best performance on three datasets and is comparable to the best approach on CUB, which indicates the effectiveness of our framework. SAE [16] gets poor performance on aPY probably due to that it is not robust to the weak relations between seen and unseen classes. We owe the success of CDL to the structure alignment procedure. Different from other approaches, where fixed semantic information is utilized to perform the recognition task, we automatically adjust the semantic space by aligning the visual-semantic structures. Since the visual space is more discriminative and the semantic space is more extensive, it will benefit each other by aligning the structures for the two spaces. Compared with [4], we get slightly lower result on CUB and this may be caused by the less discriminative class structures. CUB is a fine-grained dataset, where most classes are very similar, so less discriminative class relations could be obtained in the visual space. While [4] learns more complicated image classifiers to enhance the discriminative property in the visual space.
4.4 Effectiveness of the Proposed Framework
In order to demonstrate the effectiveness of each component proposed in our framework, we compare our approach with different submodels. The recognition task is performed in the best space according to the datasets. Specifically, for CUB, SUNA, aPY, we evaluate the performance by combining the visual space and the aligned space; for AwA, we evaluate the performance in the visual space. Figure 4 shows the zero-shot recognition results of different submodels. By comparing the performance of “NA"and “CDL", we can figure out that the models will improve a lot by aligning the visual-semantic structures and the less discriminative semantic space will be improved with the help of discriminative visual space. However, if the seen-class prototypes are fixed, it becomes difficult to align the structures between the two spaces and the models degrade seriously, which can be seen through the comparisons of “CDL"and “CDL-Pr". Moreover, the models will be more suitable to the unseen classes by utilizing the unseen-class semantic information to adapt the learning procedure, which is indicated by the comparisons of “CDL"and “CDL-Ad".
4.5 Visualization of the Class Structures
In order to have an intuitive understanding of structure alignment, we visualize the class prototypes in the visual space and semantic space on aPY, since the classes in aPY are more easy to understand. In the visual space, we obtain the class prototypes by the mean feature vector of all samples belonging to each class. In the semantic space, we get the class prototypes directly from the semantic representations. Then we use multidimensional scaling (MDS) approach [18] to visualize the class prototypes, where the relations of all classes are preserved. The original class structures in the semantic space and the visual space are shown in the first row of Fig. 5. To make the figure more intuitive, we manually gathered the classes into three groups, i.e. Vehicle, Animal and House. We can figure out that the class structures in the semantic space are not discriminative enough, as can be seen by the tight structures among animals, while those in the visual space are more discriminative. Moreover, the structures between these two space are seriously inconsistent, so directly sharing the structures from the semantic space to the visual space to synthesize the unseen-class prototypes will degrade the model. Therefore, we propose to learn the representation bases in each space to reformulate the class prototypes and align the class structures in a common space. It can be seen that the semantic structures become more discriminative after structure alignment. For example, in the original semantic space, dog and cat are mostly overlapped and they are separated after structure alignment with the help of their relations in the visual space. Thus the aligned semantic space becomes more discriminative to different classes. Moreover, the aligned structures in the two spaces become more consistent than those in the original spaces.
4.6 Visualization of Class Prototypes
The prototype of one class should locate near the samples belonging to the corresponding class. In order to check whether the prototypes are properly learned, we visualize the prototypes and corresponding samples in the visual space. To have more intuitive understanding, we choose 10 seen classes and 5 unseen classes from AwA. Then we use t-SNE [21] to project the visual samples and class prototypes to a 2-D plane. The visualization results are shown in Fig. 6. It can be seen that most prototypes locate near the samples belonging to the same classes. Although the unseen prototypes deviate from the centers of corresponding samples due to the fact that no corresponding images are provided for training, they are still discriminative enough to classify different classes, which shows the expansibility of our structure alignment approach for prototype learning. More visualization results can be seen in the supplementary material.
4.7 Generalized Zero-Shot Learning
To demonstrate the effectiveness of the proposed framework, we also apply our method to the generalized zero-shot learning (GZSL) task, where the seen class are also considered in the test procedure. The task for GZSL is to learn images classifiers \(f_{gzsl}: \mathcal {X} \rightarrow \mathcal {Y}^{s} \bigcup \mathcal {Y}^{u}\). We adopt the data splits provided by [36] and compare our method with several popular approaches. Table 3 shows the generalized zero-shot recognition results on the four datasets. It can be seen that most approaches get low accuracy on the unseen-class samples because of overfitting the seen classes, while our framework gets better results on the unseen classes and achieves more balanced results between the seen and unseen classes. By jointly aligning the visual-semantic structures and utilizing the semantic information of unseen classes to make an adaption, our model has less tendency to overfit the seen classes.
5 Conclusions
In this paper, we propose a coupled dictionary learning framework to align the visual-semantic structures for zero-shot learning, where unseen-class prototypes are learned by sharing the aligned structures. Extensive experiments on four bench-mark datasets show the effectiveness of the proposed approach. The success of CDL should be owing to three characters. First, instead of using the fixed semantic information to perform recognition task, our structure alignment approach shares the discriminative property lying in the visual space and the extensive property lying in the semantic space, which benefits each other and improves the incomplete semantic space. Second, by utilizing the unseen-class semantics to adapt the learning procedure, our model is more suitable for the unseen classes. Third, the class prototypes are automatically learned by sharing the aligned structures, which makes it possible to directly perform recognition task using simple nearest neighbour approach. Moreover, we combine the information of multiple spaces to improve the recognition performance.
Notes
- 1.
Source code of CDL is available at http://vipl.ict.ac.cn/resources/codes.
References
Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label-embedding for attribute-based classification. In: Proceedings of Computer Vision and Pattern Recognition, pp. 819–826 (2013)
Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.: Evaluation of output embeddings for fine-grained image classification. In: Proceedings of Computer Vision and Pattern Recognition, pp. 2927–2936 (2015)
Bucher, M., Herbin, S., Jurie, F.: Improving semantic embedding consistency by metric learning for zero-shot classiffication. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 730–746. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_44
Changpinyo, S., Chao, W.L., Gong, B., Sha, F.: Synthesized classifiers for zero-shot learning. In: Proceedings of Computer Vision and Pattern Recognition, pp. 5327–5336 (2016)
Changpinyo, S., Chao, W.L., Sha, F.: Predicting visual exemplars of unseen classes for zero-shot learning. In: Proceedings of International Conference on Computer Vision, pp. 3496–3505 (2017)
Chao, W.-L., Changpinyo, S., Gong, B., Sha, F.: An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 52–68. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_4
Demirel, B., Cinbis, R.G., Ikizler-Cinbis, N.: Attributes2Classname: a discriminative model for attribute-based unsupervised zero-shot learning. In: Proceedings of International Conference on Computer Vision, pp. 1241–1250 (2017)
Ding, Z., Shao, M., Fu, Y.: Low-rank embedded ensemble semantic dictionary for zero-shot learning. In: Proceedings of Computer Vision and Pattern Recognition, pp. 6005–6013 (2017)
Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: Proceedings of Computer Vision and Pattern Recognition, pp. 1778–1785 (2009)
Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: Proceedings of Advances in Neural Information Processing Systems, pp. 2121–2129 (2013)
Fu, Y., Hospedales, T.M., Xiang, T., Gong, S.: Transductive multi-view zero-shot learning. IEEE Trans. Pattern Anal. Mach. Intell. 37, 2332–2345 (2015)
Fu, Z.Y., Xiang, T.A., Kodirov, E., Gong, S.: Zero-shot object recognition by semantic manifold distance. In: Proceedings of Computer Vision and Pattern Recognition, pp. 2635–2644 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Jiang, H., Wang, R., Shan, S., Yang, Y., Chen, X.: Learning discriminative latent attributes for zero-shot classification. In: Proceedings of International Conference on Computer Vision, pp. 4233–4242 (2017)
Kodirov, E., Xiang, T., Fu, Z.Y., Gong, S.: Unsupervised domain adaptation for zero-shot learning. In: Proceedings of International Conference on Computer Vision, pp. 2452–2460 (2015)
Kodirov, E., Xiang, T., Gong, S.: Semantic autoencoder for zero-shot learning. In: Proceedings of Computer Vision and Pattern Recognition, pp. 4447–4456 (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Kruskal, J.B.: Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29(1), 1–27 (1964)
Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: Proceedings of Computer Vision and Pattern Recognition, pp. 951–958 (2009)
Long, Y., Liu, L., Shen, F., Shao, L., Li, X.: Zero-shot learning using synthesised unseen visual data with diffusion regularisation. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2498–2512 (2018)
van der Maaten, L., Hinton, G.E.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Morgado, P., Vasconcelos, N.: Semantically consistent regularization for zero-shot recognition. In: Proceedings of Computer Vision and Pattern Recognition, pp. 2037–2046 (2017)
Norouzi, M., et al.: Zero-shot learning by convex combination of semantic embeddings. In: Proceedings of International Conference on Learning Representations (2014)
Paredes, B.R., Torr, P.: An embarrassingly simple approach to zero-shot learning. In: Proceedings of International Conference on Machine Learning, pp. 2152–2161 (2015)
Patterson, G., Xu, C., Su, H., Hays, J.: The SUN attribute database: beyond categories for deeper scene understanding. Int. J. Comput. Vis. 108(1–2), 59–81 (2014)
Reed, S.E., Akata, Z., Schiele, B., Lee, H.: Learning deep representations of fine-grained visual descriptions. In: Proceedings of Computer Vision and Pattern Recognition, pp. 49–58 (2016)
Romera-Paredes, B., Torr, P.H.S.: An embarrassingly simple approach to zero-shot learning. In: Proceedings of International Conference on Machine Learning (2015)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
Socher, R., Ganjoo, M., Sridhar, H., Bastani, O., Manning, C.D., Ng, A.Y.: Zero-shot learning through cross-modal transfer. In: Proceedings of Advances in Neural Information Processing Systems, pp. 935–943 (2013)
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-UCSD birds-200-2011 dataset. Technical report (2011)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.J.: The caltech-UCSD birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technology (2011)
Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of Computer Vision and Pattern Recognition, pp. 5005–5013 (2016)
Xian, Y., Akata, Z., Sharma, G., Nguyen, Q.N., Hein, M., Schiele, B.: Latent embeddings for zero-shot classification. In: Proceedings of Computer Vision and Pattern Recognition, pp. 69–77 (2016)
Xian, Y., Schiele, B., Akata, Z.: Zero-shot learning - the good, the bad and the ugly. In: Proceedings of Computer Vision and Pattern Recognition (2017)
Xu, X., Shen, F., Yang, Y., Zhang, D., Shen, H.T., Song, J.: Matrix tri-factorization with manifold regularizations for zero-shot learning. In: Proceedings of Computer Vision and Pattern Recognition, pp. 2007–2016 (2017)
Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of Computer Vision and Pattern Recognition, pp. 3010–3019 (2017)
Zhang, Z., Saligrama, V.: Zero-shot learning via semantic similarity embedding. In: Proceedings of International Conference on Computer Vision, pp. 4166–4174 (2015)
Zhang, Z., Saligrama, V.: Zero-shot learning via joint latent similarity embedding. In: Proceedings of Computer Vision and Pattern Recognition, pp. 6034–6042 (2016)
Zhu, X., Anguelov, D., Ramanan, D.: Capturing long-tail distributions of object subcategories. In: Proceedings of Computer Vision and Pattern Recognition, pp. 915–922 (2014)
Acknowledgements
This work is partially supported by Natural Science Foundation of China under contracts Nos. 61390511, 61772500, 973 Program under contract No. 2015CB351802, Frontier Science Key Research Project CAS No. QYZDJ-SSW-JSC009, and Youth Innovation Promotion Association CAS No. 2015085.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Jiang, H., Wang, R., Shan, S., Chen, X. (2018). Learning Class Prototypes via Structure Alignment for Zero-Shot Recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11214. Springer, Cham. https://doi.org/10.1007/978-3-030-01249-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-01249-6_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01248-9
Online ISBN: 978-3-030-01249-6
eBook Packages: Computer ScienceComputer Science (R0)