Zero-Shot Image Classification Based on a Learnable Deep Metric
Abstract
:1. Introduction
2. Related Work
2.1. Zero-Shot Learning
2.2. Meta Learning
2.3. Semantic Features
2.4. Similarity Measure for Zero-Shot Image Classificaiton
3. Methodology
3.1. Task Define
3.2. Model
3.2.1. Common Space Embedding Module
3.2.2. Relation Module
3.2.3. Objective Function
3.2.4. Model Implementation
Algorithm 1: Training process of ZIC-LDM | |
Input: Training process iteration rounds , batch size , learning rate , semantic features , FC parameter initialized for semantic features mapping, visual feature , FC parameter for visual features mapping and relation module . | |
Output: Optimized FC parameter for semantic features mapping, FC parameter for visual features mapping and relation module . | |
1 | for do |
2 | for do |
3 | Sampling training samples and corresponding label from seen classes; |
4 | Mapping into common space: ; |
5 | Mapping into common space: ; |
6 | Concatenate and ; |
7 | Calculate similarity score: ; |
8 | Calculate MSE loss: ; |
9 | Update FC parameters for semantic features mapping, FC parameters for visual features mapping and relation module: ; |
10 | end for |
11 | end for |
3.3. Testing Process
3.3.1. Zero-Shot Image Classification
3.3.2. Generalized Zero-Shot Image Classification
4. Experiments
4.1. Dataset and Settings
4.2. Traditional Zero-Shot Image Classification
- The results of ZIC-LDM on AwA1, AwA2, CUB, and SUN datasets are better than those of the baseline SJE with the increase of 4.0%, 5.8%, 2.9%, and 5.2%, respectively. In addition, compared with the latest models Gaussian and SELAR, ZIC-LDM also achieves excellent results, which shows that our proposed model is effective in zero-shot image classification. Therefore, the ZIC-LDM with the learnable deep metric can learn good visual-semantic relationship.
- Compared with the methods, such as DAP, CONSE, ESZSL, ALE, and SynC, which use the predefined fixed metrics, ZIC-LDM has achieved the best results on AwA1, AwA2, CUB, and SUN datasets. This indicates that the learnable deep metric makes ZIC-LDM learn the visual-semantic relationship well.
- Compared with the method SAE based on semantic space embedding and methods RN and CCSS based on visual space embedding, ZIC-LDM based on common space embedding has achieved the best results on AwA1, AwA2, CUB, and SUN datasets. It shows that common space embedding can alleviate the semantic gap problem.
4.3. Generalized Zero-Shot Image Classification
- Compared with the baseline SJE, the average top-1 accuracy of ZIC-LDM is improved by 21.4%, 23.9%, 16.8%, and 8.8% respectively on four datasets. The average harmonic mean of ZIC-LDM is also superior to that of SJE, with the increasements of 28.4%, 33.3%, 15.5%, and 7.8%, respectively. In addition, compared with other traditional methods, i.e., DAP, SynC, ESZSL, ALE, SAE, and Gaussian, ZIC-LDM obtains the optimal and on AwA1, AwA2, CUB, and SUN datasets. This indicates that ZIC-LDM has more advantages in solving the deviation problem of unseen class.
- Compared with the learnable deep metric-based method RN, the of ZIC-LDM is improved by 1.3%, 1.9%, and 2.2% on AwA1, AwA2, and CUB. The average harmonic mean of ZIC-LDM is improved by 1.3%, 2.1%, and 2.1%, respectively. This indicates that common space embedding can relieve the semantic gap in generative zero-shot learning.
- Compared with the latest methods MLSE, MIIR, and SELAR, ZIC-LMD has the best and on AwA2, CUB, and SUN datasets. It shows that the combination of learnable deep metric and common space embedding is advanced and effective in generalized zero-shot image classification task.
4.4. Loss Convergence Analysis
4.5. Distance Metric Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017; pp. 4700–4708. [Google Scholar]
- Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 4690–4699. [Google Scholar]
- Liu, C.; Zoph, B.; Neumann, M.; Shlens, J.; Hua, W.; Li, L.; Fei, L.; Yuille, A.; Huang, J.; Murphy, K. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 19–34. [Google Scholar]
- Frome, A.; Corrado, G.S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; Mikolov, T. Devise: A deep visual-semantic embedding model. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS), Harrahs and Harveys, Lake Tahoe, NV, USA, 7 December 2013; pp. 2121–2129. [Google Scholar]
- Zhang, Z.; Saligrama, V. Zero-shot learning via semantic similarity embedding. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 14–17 December 2015; pp. 4166–4174. [Google Scholar]
- Romera-Paredes, B. An embarrassingly simple approach to zero-shot learning. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 2152–2161. [Google Scholar]
- Kodirov, E.; Xiang, T.; Gong, S. Semantic autoencoder for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017; pp. 3174–3183. [Google Scholar]
- Zhang, L.; Xiang, T.; Gong, S. Learning a deep embedding model for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017; pp. 2021–2030. [Google Scholar]
- Akata, Z.; Reed, S.; Walter, D.; Lee, H.; Schiele, B. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2927–2936. [Google Scholar]
- Ji, Z.; Cui, B.; Li, H.; Jiang, Y.G.; Xiang, T.; Hospedales, T.; Fu, Y. Deep ranking for image zero-shot multi-label classification. IEEE Trans. Image Process. 2020, 29, 6549–6560. [Google Scholar] [CrossRef] [PubMed]
- Ji, Z.; Sun, Y.; Yu, Y.; Pang, Y.W.; Han, J.G. Attribute-guided network for cross-modal zero-shot hashing. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 321–330. [Google Scholar] [CrossRef] [PubMed]
- Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.S.; Hospedales, T.M. Learning to Compare: Relation Network for Few-Shot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1199–1208. [Google Scholar]
- Sandouk, U.; Chen, K. Multi-Label Zero-Shot Learning via Concept Embedding. arXiv 2016, arXiv:1606.00282. [Google Scholar]
- Lampert, C.H.; Nickisch, H.; Harmeling, S. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 951–958. [Google Scholar]
- Ji, Z.; Yan, J.; Wang, Q.; Pang, Y.; Li, X. Triple discriminator generative adversarial network for zero-shot image classification. Sci. China Inf. Sci. 2021, 64, 1–14. [Google Scholar] [CrossRef]
- Ji, Z.; Chen, K.; Wang, J.; Yu, Y.; Zhang, Z. Multi-modal generative adversarial network for zero-shot learning. Knowl. Based Syst. 2020, 197, 105847. [Google Scholar] [CrossRef]
- Zhang, Z.; Li, Y.; Yang, J.; Li, Y.; Gao, M. Cross-layer autoencoder for zero-shot learning. IEEE Access 2019, 7, 167584–167592. [Google Scholar] [CrossRef]
- Yu, H.; Lee, B. Zero-shot learning via simultaneous generating and learning. arXiv 2019, arXiv:1910.09446. [Google Scholar]
- Shen, Y.; Qin, J.; Huang, L.; Liu, L.; Zhu, F.; Shao, L. Invertible zero-shot recognition flows. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 614–631. [Google Scholar]
- Al Machot, F.; RElkobaisi, M.; Kyamakya, K. Zero-shot human activity recognition using non-visual sensors. Sensors 2020, 20, 825. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Matsuki, M.; Lago, P.; Inoue, S. Characterizing Word Embeddings for Zero-Shot Sensor-Based Human Activity Recognition. Sensors 2019, 19, 5043. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ohashi, H.; Al-Naser, M.; Ahmed, S.; Nakamura, K.; Sato, T.; Dengle, A. Attributes’ Importance for Zero-Shot Pose-Classification Based on Wearable Sensors. Sensors 2018, 18, 2485. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Chao, W.L.; Changpinyo, S.; Gong, B.; Sha, F. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 52–68. [Google Scholar]
- Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv 2017, arXiv:1703.03400, 2017. [Google Scholar]
- Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; Lillicrap, T. Meta-Learning with Memory-Augmented Neural Networks. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; pp. 1842–1850. [Google Scholar]
- Snell, J.; Swersky, K.; Zemel, R.S. Prototypical networks for few-shot learning. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 4077–4087. [Google Scholar]
- Jayaraman, D.; Grauman, K. Zero-shot recognition with unreliable attributes. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 8–13 December 2014; pp. 3464–3472. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS), Harrahs and Harveys, Lake Tahoe, NV, USA, 7 December 2013; pp. 3111–3119. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October; pp. 1532–1543.
- Socher, R.; Ganjoo, M.; Sridhar, H.; Bastani, O.; Manning, C.D.; Ng, A.Y. Zero-shot learning through cross-modal transfer. arXiv 2013, arXiv:1301.3666. [Google Scholar]
- Xie, G.S.; Liu, L.; Jin, X.; Zhu, F.; Zhang, Z.; Qin, J.; Yao, Y.; Shao, L. Attentive region embedding network for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 9384–9493. [Google Scholar]
- Reed, S.; Akata, Z.; Lee, H.; Schiele, B. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 49–58. [Google Scholar]
- Ba, J.L.; Swersky, K.; Fidler, S.; Salakhutdinov, R. Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions. In Proceedings of the IEEE International Conference on Computer Vision (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4247–4255. [Google Scholar]
- Xian, Y.; Lampert, C.H.; Schiele, B.; Akata, Z. Zero-shot learning—A comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2251–2265. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-Ucsd Birds-200–2011 Dataset; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
- Patterson, G.; Hays, J. SUN attribute database: Discovering, annotating, and recognizing scene attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 2751–2758. [Google Scholar]
- Norouzi, M.; Mikolov, T.; Bengio, S.; Singer, Y.; Shlens, J.; Frome, A.; Corrado, G.S.; Dean, J. Zero-shot learning by convex combination of semantic embeddings. arXiv 2013, arXiv:1312.5650. [Google Scholar]
- Akata, Z.; Perronnin, F.; Harchaoui, Z.; Schmid, C. Label-Embedding for Image Classification. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 1425–1438. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Changpinyo, S.; Chao, W.; Gong, B.; Sha, F. Synthesized Classifiers for Zero-Shot Learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 5327–5336. [Google Scholar]
- Liu, J.; Li, X.; Yang, G. Cross-Class Sample Synthesis for Zero-shot Learning. In Proceedings of the 29th British Machine Vision Conference (NMVC), Newcastle, UK, 3–6 September 2018; pp. 113–124. [Google Scholar]
- Zhang, H.; Koniusz, P. Zero-shot kernel learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7670–76790. [Google Scholar]
- Yang, S.Q.; Wang, K.; Herranz, L. Simple and effective localized attribute representations for zero-shot learning. arXiv 2020, arXiv:2006.05938. [Google Scholar]
- Le Cacheux, Y.; le Borgne, H.; Crucianu, M. Modeling Inter and Intra-Class Relations in the Triplet Loss for Zero-Shot Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–3 November 2019; pp. 10333–10342. [Google Scholar]
- Ding, Z.; Liu, H. Marginalized Latent Semantic Encoder for Zero-Shot Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 6191–6199. [Google Scholar]
Model | AwA1 | AwA2 | CUB | SUN |
---|---|---|---|---|
DAP [15] | 44.1 | 46.1 | 40.0 | 39.9 |
ConSE [38] | 45.6 | 44.5 | 34.3 | 38.8 |
ESZSL [7] | 58.2 | 58.6 | 53.9 | 54.5 |
ALE [39] | 59.9 | 62.5 | 54.9 | 58.1 |
SynC [40] | 54.0 | 46.6 | 55.6 | 56.3 |
SAE [8] | 53.0 | 54.1 | 33.3 | 40.3 |
CCSS [41] | 56.3 | 63.7 | 44.1 | 56.8 |
Gaussian [42] | 60.5 | 61.2 | 52.1 | 58.7 |
SELAR [43] | - | 66.7 | 56.4 | 57.8 |
RN [14] | 68.2 | 64.2 | 55.6 | - |
SJE [10] | 65.6 | 61.9 | 53.9 | 53.7 |
ZIC-LDM | 69.6 | 67.7 | 56.8 | 58.9 |
Model | AwA1 | AwA2 | CUB | SUN | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
DAP [15] | 0.0 | 88.7 | 0.0 | 0.0 | 84.7 | 0.0 | 1.7 | 67.9 | 3.3 | 4.2 | 25.1 | 7.2 |
SynC [40] | 8.9 | 87.3 | 16.2 | 10.0 | 90.5 | 18.0 | 11.5 | 70.9 | 19.8 | 7.9 | 43.3 | 13.4 |
ESZSL [7] | 6.6 | 75.6 | 12.1 | 5.9 | 77.8 | 11.0 | 12.6 | 63.8 | 21.0 | 11.0 | 27.9 | 15.8 |
ALE [39] | 16.8 | 76.1 | 27.5 | 14.0 | 81.8 | 23.9 | 23.7 | 62.8 | 34.4 | 21.8 | 33.1 | 26.3 |
SAE [8] | 1.8 | 77.1 | 3.5 | 1.1 | 82.2 | 2.2 | 7.8 | 54.0 | 13.6 | 8.8 | 18.0 | 11.8 |
ConSE [38] | 0.4 | 88.6 | 0.8 | 0.5 | 90.6 | 1.0 | 1.6 | 72.2 | 3.1 | 6.8 | 39.9 | 11.6 |
Gaussian [42] | 6.1 | 81.3 | 11.4 | 7.3 | 79.1 | 13.3 | 17.5 | 59.9 | 27.1 | 18.2 | 33.2 | 23.5 |
MLSE [45] | - | - | - | 23.8 | 83.2 | 37.0 | 22.3 | 71.6 | 34.0 | 20.7 | 36.4 | 26.4 |
MIIR [44] | - | - | - | 17.6 | 87.0 | 28.9 | 30.4 | 65.8 | 41.2 | 22.0 | 34.1 | 26.7 |
SELAR [43] | - | - | - | 31.6 | 80.3 | 45.3 | 32.1 | 63.0 | 42.5 | 22.8 | 31.6 | 26.5 |
RN [14] | 31.4 | 91.3 | 46.7 | 30.0 | 93.4 | 45.3 | 38.1 | 61.1 | 47.0 | - | - | - |
SJE [10] | 11.3 | 74.6 | 19.6 | 8.0 | 73.9 | 14.4 | 23.5 | 59.2 | 33.6 | 14.7 | 30.5 | 19.8 |
ZIC-LDM | 32.7 | 90.5 | 48.0 | 31.9 | 92.5 | 47.4 | 40.3 | 62.9 | 49.1 | 23.5 | 33.9 | 27.6 |
Model | AwA1 | AwA2 | CUB | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
ED | 55.2 | 5.4 | 68.3 | 10.0 | 55.8 | 5.7 | 69.5 | 10.5 | 42.7 | 8.2 | 53.1 | 14.2 |
CS | 55.4 | 5.9 | 68.6 | 10.9 | 55.7 | 5.1 | 70.2 | 9.5 | 42.9 | 8.5 | 53.5 | 14.7 |
MML | 56.7 | 6.3 | 70.4 | 11.6 | 56.7 | 6.1 | 73.7 | 11.3 | 16.8 | 10.5 | 54.1 | 17.6 |
ZIC-LDM | 69.6 | 32.7 | 90.5 | 48.0 | 67.7 | 31.9 | 92.5 | 47.4 | 56.8 | 40.3 | 62.9 | 49.1 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, J.; Shi, C.; Tu, D.; Shi, Z.; Liu, Y. Zero-Shot Image Classification Based on a Learnable Deep Metric. Sensors 2021, 21, 3241. https://doi.org/10.3390/s21093241
Liu J, Shi C, Tu D, Shi Z, Liu Y. Zero-Shot Image Classification Based on a Learnable Deep Metric. Sensors. 2021; 21(9):3241. https://doi.org/10.3390/s21093241
Chicago/Turabian StyleLiu, Jingyi, Caijuan Shi, Dongjing Tu, Ze Shi, and Yazhi Liu. 2021. "Zero-Shot Image Classification Based on a Learnable Deep Metric" Sensors 21, no. 9: 3241. https://doi.org/10.3390/s21093241
APA StyleLiu, J., Shi, C., Tu, D., Shi, Z., & Liu, Y. (2021). Zero-Shot Image Classification Based on a Learnable Deep Metric. Sensors, 21(9), 3241. https://doi.org/10.3390/s21093241