research-article

Dual-path Convolutional Image-Text Embeddings with Instance Loss

Authors:

Michael Garrett,

Yi-Dong ShenAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 16, Issue 2

Article No.: 51, Pages 1 - 23

https://doi.org/10.1145/3383184

Published: 22 May 2020 Publication History

Abstract

Matching images and sentences demands a fine understanding of both modalities. In this article, we propose a new system to discriminatively embed the image and text to a shared visual-textual space. In this field, most existing works apply the ranking loss to pull the positive image/text pairs close and push the negative pairs apart from each other. However, directly deploying the ranking loss on heterogeneous features (i.e., text and image features) is less effective, because it is hard to find appropriate triplets at the beginning. So the naive way of using the ranking loss may compromise the network from learning inter-modal relationship. To address this problem, we propose the instance loss, which explicitly considers the intra-modal data distribution. It is based on an unsupervised assumption that each image/text group can be viewed as a class. So the network can learn the fine granularity from every image/text group. The experiment shows that the instance loss offers better weight initialization for the ranking loss, so that more discriminative embeddings can be learned. Besides, existing works usually apply the off-the-shelf features, i.e., word2vec and fixed visual feature. So in a minor contribution, this article constructs an end-to-end dual-path convolutional network to learn the image and text representations. End-to-end learning allows the system to directly learn from the data and fully utilize the supervision. On two generic retrieval datasets (Flickr30k and MSCOCO), experiments demonstrate that our method yields competitive accuracy compared to state-of-the-art methods. Moreover, in language-based person retrieval, we improve the state of the art by a large margin. The code has been made publicly available.

References

[1]

Lluis Castrejon, Yusuf Aytar, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Learning aligned cross-modal representations from weakly aligned data. In Proceedings of the CVPR.

[2]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2016. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915.

[3]

Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao. 2015. Event extraction via dynamic multi-pooling convolutional neural networks. In Proceedings of the ACL.

[4]

Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2018. Paying more attention to saliency: Image captioning with saliency and context attention. ACM Trans. Multimedia Comput. Commun. Appl. 14, 2 (2018), 48.

Digital Library

[5]

Cheng Deng, Xu Tang, Junchi Yan, Wei Liu, and Xinbo Gao. 2015. Discriminative dictionary learning with common label alignment for cross-modal retrieval. IEEE Transactions on Multimedia 18, 2 (2015), 208--218.

Digital Library

[6]

Cheng Deng, Erkun Yang, Tongliang Liu, Wei Liu, Jie Li, and Dacheng Tao. 2019. Unsupervised semantic-preserving adversarial hashing for image search. IEEE Trans. Image Process. 28, 8 (2019), 4032--4044.

[7]

Guiguang Ding, Yuchen Guo, Jile Zhou, and Yue Gao. 2016. Large-scale cross-modality search via collective matrix factorization hashing. IEEE Transactions on Image Processing 25, 11 (2016), 5427--5440.

Digital Library

[8]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the CVPR.

[9]

Aviv Eisenschtat and Lior Wolf. 2017. Linking image and text with 2-way nets. In Proceedings of the CVPR.

[10]

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improved visual-semantic embeddings. In Proceeding of BMVC (2018).

[11]

Hehe Fan, Liang Zheng, Chenggang Yan, and Yi Yang. 2018. Unsupervised person re-identification: Clustering and fine-tuning. ACM Trans. Multimedia Comput. Commun. Appl. 14, 4 (2018).

Digital Library

[12]

Fangxiang Feng, Xiaojie Wang, Ruifan Li, and Ibrar Ahmad. 2015. Correspondence autoencoders for cross-modal retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 12, 1s (2015), 26.

Digital Library

[13]

Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. 2013. Devise: A deep visual-semantic embedding model. In Proceedings of the NIPS.

[14]

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the ICML.

Digital Library

[15]

Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the AISTAT.

[16]

Douglas Gray, Shane Brennan, and Hai Tao. 2007. Evaluating appearance models for recognition, reacquisition, and tracking. In Proceedings of the PETS.

[17]

Jiuxiang Gu, Jianfei Cai, Shafiq R. Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the CVPR.

[18]

David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 16, 12 (2004), 2639–2664.

Digital Library

[19]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR.

[20]

Ran He, Man Zhang, Liang Wang, Ye Ji, and Qiyue Yin. 2015. Cross-modal subspace learning via pairwise constraints. IEEE Transactions on Image Processing 24, 12 (2015), 5543–5556.

Digital Library

[21]

Xinwei He, Baoguang Shi, Xiang Bai, Gui-Song Xia, Zhaoxiang Zhang, and Weisheng Dong. 2019. Image caption generation with part of speech guidance. Pattern Recogn. Lett. 119 (2019), 229–237.

Digital Library

[22]

Yonghao He, Shiming Xiang, Cuicui Kang, Jian Wang, and Chunhong Pan. 2016. Cross-modal retrieval via deep and bidirectional representation learning. IEEE Trans. Multimedia 18, 7 (2016), 1363--1377.

Digital Library

[23]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.

Digital Library

[24]

Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artific. Intell. Res. 47 (2013), 853--899.

Digital Library

[25]

Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neural network architectures for matching natural language sentences. In Proceedings of the NIPS.

[26]

Yuting Hu, Liang Zheng, Yi Yang, and Yongfeng Huang. 2018. Twitter100k: A real-world dataset for weakly supervised cross-media retrieval. IEEE Transactions on Multimedia 20, 4 (2017), 927–938.

Digital Library

[27]

Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the ACL.

Digital Library

[28]

Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-aware image and sentence matching with selective multimodal LSTM. In Proceedings of the CVPR.

[29]

Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the CVPR.

[30]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the CVPR.

[31]

Andrej Karpathy, Armand Joulin, and Fei Fei F. Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Proceedings of the NIPS.

[32]

Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the EMNLP.

[33]

Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using fisher vectors. In Proceedings of the CVPR.

[34]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the NIPS.

Digital Library

[35]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the ECCV.

[36]

Guy Lev, Gil Sadeh, Benjamin Klein, and Lior Wolf. 2016. RNN fisher vectors for action recognition and image annotation. In Proceedings of the ECCV.

[37]

Kai Li, Guo-Jun Qi, and Kien A. Hua. 2017. Learning label preserving binary codes for multimedia retrieval: A general approach. ACM Trans. Multimedia Comput. Commun. Appl. 14, 1 (2017), 2.

[38]

Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xiaogang Wang. 2017. Identity-aware textual-visual matching with latent co-attention. In Proceedings of the ICCV.

[39]

Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017. Person search with natural language description. In Proceedings of the CVPR.

[40]

Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. 2014. Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the CVPR.

Digital Library

[41]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the ECCV.

[42]

Xiao Lin and Devi Parikh. 2016. Leveraging visual question answering for image-caption ranking. In Proceedings of the ECCV.

[43]

Yutian Lin, Liang Zheng, Zhedong Zheng, Yu Wu, Zhilan Hu, Chenggang Yan, and Yi Yang. 2019. Improving person re-identification by attribute and identity learning. Pattern Recogn. 95 (2019), 151--161.

Digital Library

[44]

Ruoyu Liu, Yao Zhao, Shikui Wei, Liang Zheng, and Yi Yang. 2019. Modality-invariant image-text embedding for image-sentence matching. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1 (2019), 1--19.

Digital Library

[45]

Xianglong Liu, Lei Huang, Cheng Deng, Bo Lang, and Dacheng Tao. 2016. Query-adaptive hash code ranking for large-scale multi-view visual search. IEEE Transactions on Image Processing 25, 10 (2016), 4514–4524.

Digital Library

[46]

Yu Liu, Yanming Guo, Erwin M. Bakker, and Michael S. Lew. 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the ICCV.

[47]

Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. 2015. Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the ICCV.

Digital Library

[48]

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-rnn). In Proceedings of the ICLR.

[49]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781.

[50]

Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of the Interspeech.

[51]

Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the CVPR.

[52]

Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. 2017. Hierarchical multimodal LSTM for dense visual-semantic embedding. In Proceedings of the ICCV.

[53]

Xuelin Qian, Yanwei Fu, Yu-Gang Jiang, Tao Xiang, and Xiangyang Xue. 2017. Multi-scale deep learning architectures for person re-identification. In Proceedings of the CVPR.

[54]

Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using Amazon’s mechanical turk. In Proceedings of the NAACL HLT. Association for Computational Linguistics, 139--147.

[55]

Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the ACM MM.

Digital Library

[56]

Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 2016. Learning deep representations of fine-grained visual descriptions. In Proceedings of the CVPR.

[57]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 3 (2015), 211--252.

Digital Library

[58]

Abhishek Sharma, Abhishek Kumar, Hal Daume, and David W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In Proceedings of the CVPR.

Digital Library

[59]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.

[60]

A. Vedaldi and K. Lenc. 2015. MatConvNet—Convolutional neural networks for MATLAB. In Proceedings of the ACM MM.

[61]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the CVPR.

[62]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2017. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 652--663.

Digital Library

[63]

Cheng Wang, Haojin Yang, and Christoph Meinel. 2018. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans. Multimedia Comput. Commun. Appl. 14, 2s (2018), 40.

Digital Library

[64]

Di Wang, Xinbo Gao, Xiumei Wang, Lihuo He, and Bo Yuan. 2016. Multimodal discriminative binary embedding for large-scale cross-modal retrieval. IEEE Transactions on Image Processing 25, 10 (2016), 4540–4554.

Digital Library

[65]

Kaiye Wang, Ran He, Wei Wang, Liang Wang, and Tieniu Tan. 2013. Learning coupled feature spaces for cross-modal matching. In Proceedings of the ICCV.

Digital Library

[66]

Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the CVPR.

[67]

Liwei Wang, Yin Li, and Svetlana Lazebnik. 2017. Learning two-branch neural networks for image-text matching tasks. arXiv:1704.03470.

[68]

Wei Wang, Beng Chin Ooi, Xiaoyan Yang, Dongxiang Zhang, and Yueting Zhuang. 2014. Effective multi-modal retrieval based on stacked auto-encoders. Proc. VLDB Endow. 7, 8 (2014), 649--660.

Digital Library

[69]

Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-modal retrieval with cnn visual features: A new baseline. IEEE Trans. Cybernet. 47, 2 (2017), 449--460.

[70]

Yunchao Wei, Yao Zhao, Zhenfeng Zhu, Shikui Wei, Yanhui Xiao, Jiashi Feng, and Shuicheng Yan. 2016. Modality-dependent cross-media retrieval. ACM Trans. Intell. Syst. Technol. 7, 4 (2016), 1–13.

Digital Library

[71]

Fei Wu, Xinyan Lu, Zhongfei Zhang, Shuicheng Yan, Yong Rui, and Yueting Zhuang. 2013. Cross-media semantic representation via bi-directional learning to rank. In Proceedings of the ACM MM.

Digital Library

[72]

Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wei Bian, and Yi Yang. 2019. Progressive learning for person re-identification with one example. IEEE Trans. Image Process. 28, 6 (June 2019), 2872--2881.

[73]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144.

[74]

Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the CVPR.

[75]

Yan Yan, Feiping Nie, Wen Li, Chenqiang Gao, Yi Yang, and Dong Xu. 2016. Image classification by cross-media active learning with privileged information. IEEE Trans. Multimedia 18, 12 (2016), 2494--2502.

Digital Library

[76]

Erkun Yang, Cheng Deng, Chao Li, Wei Liu, Jie Li, and Dacheng Tao. 2018. Shared predictive cross-modal deep quantization. IEEE Trans. Neural Netw. Learn. Syst.99 (2018), 1--12.

[77]

Yi Yang, Feiping Nie, Dong Xu, Jiebo Luo, Yueting Zhuang, and Yunhe Pan. 2011. A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans. Pattern Anal. Mach. Intell. 34, 4 (2011), 723--742.

Digital Library

[78]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2 (2014), 67--78.

[79]

Changqing Zhang, Huazhu Fu, Qinghua Hu, Pengfei Zhu, and Xiaochun Cao. 2017. Flexible multi-view dimensionality co-reduction. IEEE Transactions on Image Processing 26, 2 (2016), 648–659.

Digital Library

[80]

Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell. 2014. Part-based R-CNNs for fine-grained category detection. In Proceedings of the ECCV.

[81]

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Proceedings of the NIPS.

[82]

Yuting Zhang, Luyao Yuan, Yijie Guo, Zhiyuan He, I-An Huang, and Honglak Lee. 2017. Discriminative bimodal networks for visual localization and detection with natural language queries. In Proceedings of the CVPR.

[83]

Liang Zheng, Yi Yang, and Alexander G. Hauptmann. 2016. Person re-identification: Past, present, and future. arXiv:1610.02984.

[84]

Zhedong Zheng, Liang Zheng, and Yi Yang. 2017. A discriminatively learned CNN embedding for person re-identification. ACM Trans. Multimedia Comput. Commun. Appl. 14, 1 (2017), 1–20.

Digital Library

[85]

Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Uncovering the temporal context for video question answering. Int. J. Comput. Vision 124, 3 (2017), 409--421.

Digital Library

Cited By

He QXu AZhang YYe ZZhou WXi RLin Q(2024)A Contrastive Learning Based Multiview Scene Matching Method for UAV View Geo-LocalizationRemote Sensing10.3390/rs1616303916:16(3039)Online publication date: 19-Aug-2024
https://doi.org/10.3390/rs16163039
Gong NLi LSha JSun XHuang Q(2024)A Satellite-Drone Image Cross-View Geolocalization Method Based on Multi-Scale Information and Dual-Channel Attention MechanismRemote Sensing10.3390/rs1606094116:6(941)Online publication date: 7-Mar-2024
https://doi.org/10.3390/rs16060941
Zeng RMa WWu XLiu WLiu J(2024)Image–Text Cross-Modal Retrieval with Instance Contrastive EmbeddingElectronics10.3390/electronics1302030013:2(300)Online publication date: 9-Jan-2024
https://doi.org/10.3390/electronics13020300
Show More Cited By

Index Terms

Dual-path Convolutional Image-Text Embeddings with Instance Loss
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
      2. Computer vision tasks
        Visual content-based indexing and retrieval

Recommendations

Deep convolutional learning for Content Based Image Retrieval

In this paper we propose a model retraining method for learning more efficient convolutional representations for Content Based Image Retrieval. We employ a deep CNN model to obtain the feature representations from the activations of the convolutional ...
Convolutional Patch Representations for Image Retrieval: An Unsupervised Approach

Convolutional neural networks (CNNs) are able to model local stationary structures in natural images in a multi-scale fashion, when learning all model parameters with supervision. While excellent performance was achieved for image classification when ...
Weakly supervised image classification and pointwise localization with graph convolutional networks
Highlights
- A new deep learning framework is proposed in this paper, which can leverage the object label inter-dependent for weakly supervised learning.
Abstract
In computer vision, the research community has been looking to how to benefit from weakly supervised learning that utilizes easily obtained image-level labels to train neural network models. The existing deep convolutional neural ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 16, Issue 2

May 2020

390 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3401894

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 May 2020

Online AM: 07 May 2020

Accepted: 01 February 2020

Revised: 01 May 2019

Received: 01 August 2018

Published in TOMM Volume 16, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

267
Total Citations
View Citations
1,187
Total Downloads

Downloads (Last 12 months)222
Downloads (Last 6 weeks)27

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

He QXu AZhang YYe ZZhou WXi RLin Q(2024)A Contrastive Learning Based Multiview Scene Matching Method for UAV View Geo-LocalizationRemote Sensing10.3390/rs1616303916:16(3039)Online publication date: 19-Aug-2024
https://doi.org/10.3390/rs16163039
Gong NLi LSha JSun XHuang Q(2024)A Satellite-Drone Image Cross-View Geolocalization Method Based on Multi-Scale Information and Dual-Channel Attention MechanismRemote Sensing10.3390/rs1606094116:6(941)Online publication date: 7-Mar-2024
https://doi.org/10.3390/rs16060941
Zeng RMa WWu XLiu WLiu J(2024)Image–Text Cross-Modal Retrieval with Instance Contrastive EmbeddingElectronics10.3390/electronics1302030013:2(300)Online publication date: 9-Jan-2024
https://doi.org/10.3390/electronics13020300
Sun HQin XLiu X(2024)Learning hierarchical embedding space for image-text matchingIntelligent Data Analysis10.3233/IDA-23021428:3(647-665)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.3233/IDA-230214
Zhao ZZhou GXie ZWu LHuang J(2024)CGKPN: Cross-Graph Knowledge Propagation Network with Adaptive Connection for Reasoning-Based Machine Reading ComprehensionACM Transactions on Intelligent Systems and Technology10.1145/365867315:4(1-24)Online publication date: 17-Apr-2024
https://dl.acm.org/doi/10.1145/3658673
Wang DYan FWang YZhao LLiang XZhong HZhang RGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Fine-grained Semantics-aware Representation Learning for Text-based Person RetrievalProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658054(92-100)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658054
Chen ZSun A(2024)DP-GCN: Node Classification by Connectivity and Local Topology Structure on Real-World NetworkACM Transactions on Knowledge Discovery from Data10.1145/364946018:6(1-20)Online publication date: 12-Apr-2024
https://dl.acm.org/doi/10.1145/3649460
Su LQuan RQi ZQin JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)MACA: Memory-aided Coarse-to-fine Alignment for Text-based Person SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657915(2497-2501)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657915
Zheng ZWang XZheng NYang Y(2024)Parameter-Efficient Person Re-Identification in the 3D SpaceIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.321483435:6(7534-7547)Online publication date: Jun-2024
https://doi.org/10.1109/TNNLS.2022.3214834
Luo XJiang MKong JTao X(2024)Hierarchical Camera-Aware Contrast Extension for Unsupervised Person Re-IdentificationIEEE Transactions on Multimedia10.1109/TMM.2024.336990426(7636-7648)Online publication date: 26-Feb-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3369904
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents