Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Dual-path Convolutional Image-Text Embeddings with Instance Loss

Published: 22 May 2020 Publication History

Abstract

Matching images and sentences demands a fine understanding of both modalities. In this article, we propose a new system to discriminatively embed the image and text to a shared visual-textual space. In this field, most existing works apply the ranking loss to pull the positive image/text pairs close and push the negative pairs apart from each other. However, directly deploying the ranking loss on heterogeneous features (i.e., text and image features) is less effective, because it is hard to find appropriate triplets at the beginning. So the naive way of using the ranking loss may compromise the network from learning inter-modal relationship. To address this problem, we propose the instance loss, which explicitly considers the intra-modal data distribution. It is based on an unsupervised assumption that each image/text group can be viewed as a class. So the network can learn the fine granularity from every image/text group. The experiment shows that the instance loss offers better weight initialization for the ranking loss, so that more discriminative embeddings can be learned. Besides, existing works usually apply the off-the-shelf features, i.e., word2vec and fixed visual feature. So in a minor contribution, this article constructs an end-to-end dual-path convolutional network to learn the image and text representations. End-to-end learning allows the system to directly learn from the data and fully utilize the supervision. On two generic retrieval datasets (Flickr30k and MSCOCO), experiments demonstrate that our method yields competitive accuracy compared to state-of-the-art methods. Moreover, in language-based person retrieval, we improve the state of the art by a large margin. The code has been made publicly available.

References

[1]
Lluis Castrejon, Yusuf Aytar, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Learning aligned cross-modal representations from weakly aligned data. In Proceedings of the CVPR.
[2]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2016. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915.
[3]
Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao. 2015. Event extraction via dynamic multi-pooling convolutional neural networks. In Proceedings of the ACL.
[4]
Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2018. Paying more attention to saliency: Image captioning with saliency and context attention. ACM Trans. Multimedia Comput. Commun. Appl. 14, 2 (2018), 48.
[5]
Cheng Deng, Xu Tang, Junchi Yan, Wei Liu, and Xinbo Gao. 2015. Discriminative dictionary learning with common label alignment for cross-modal retrieval. IEEE Transactions on Multimedia 18, 2 (2015), 208--218.
[6]
Cheng Deng, Erkun Yang, Tongliang Liu, Wei Liu, Jie Li, and Dacheng Tao. 2019. Unsupervised semantic-preserving adversarial hashing for image search. IEEE Trans. Image Process. 28, 8 (2019), 4032--4044.
[7]
Guiguang Ding, Yuchen Guo, Jile Zhou, and Yue Gao. 2016. Large-scale cross-modality search via collective matrix factorization hashing. IEEE Transactions on Image Processing 25, 11 (2016), 5427--5440.
[8]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the CVPR.
[9]
Aviv Eisenschtat and Lior Wolf. 2017. Linking image and text with 2-way nets. In Proceedings of the CVPR.
[10]
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improved visual-semantic embeddings. In Proceeding of BMVC (2018).
[11]
Hehe Fan, Liang Zheng, Chenggang Yan, and Yi Yang. 2018. Unsupervised person re-identification: Clustering and fine-tuning. ACM Trans. Multimedia Comput. Commun. Appl. 14, 4 (2018).
[12]
Fangxiang Feng, Xiaojie Wang, Ruifan Li, and Ibrar Ahmad. 2015. Correspondence autoencoders for cross-modal retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 12, 1s (2015), 26.
[13]
Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. 2013. Devise: A deep visual-semantic embedding model. In Proceedings of the NIPS.
[14]
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the ICML.
[15]
Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the AISTAT.
[16]
Douglas Gray, Shane Brennan, and Hai Tao. 2007. Evaluating appearance models for recognition, reacquisition, and tracking. In Proceedings of the PETS.
[17]
Jiuxiang Gu, Jianfei Cai, Shafiq R. Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the CVPR.
[18]
David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 16, 12 (2004), 2639–2664.
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR.
[20]
Ran He, Man Zhang, Liang Wang, Ye Ji, and Qiyue Yin. 2015. Cross-modal subspace learning via pairwise constraints. IEEE Transactions on Image Processing 24, 12 (2015), 5543–5556.
[21]
Xinwei He, Baoguang Shi, Xiang Bai, Gui-Song Xia, Zhaoxiang Zhang, and Weisheng Dong. 2019. Image caption generation with part of speech guidance. Pattern Recogn. Lett. 119 (2019), 229–237.
[22]
Yonghao He, Shiming Xiang, Cuicui Kang, Jian Wang, and Chunhong Pan. 2016. Cross-modal retrieval via deep and bidirectional representation learning. IEEE Trans. Multimedia 18, 7 (2016), 1363--1377.
[23]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.
[24]
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artific. Intell. Res. 47 (2013), 853--899.
[25]
Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neural network architectures for matching natural language sentences. In Proceedings of the NIPS.
[26]
Yuting Hu, Liang Zheng, Yi Yang, and Yongfeng Huang. 2018. Twitter100k: A real-world dataset for weakly supervised cross-media retrieval. IEEE Transactions on Multimedia 20, 4 (2017), 927–938.
[27]
Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the ACL.
[28]
Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-aware image and sentence matching with selective multimodal LSTM. In Proceedings of the CVPR.
[29]
Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the CVPR.
[30]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the CVPR.
[31]
Andrej Karpathy, Armand Joulin, and Fei Fei F. Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Proceedings of the NIPS.
[32]
Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the EMNLP.
[33]
Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using fisher vectors. In Proceedings of the CVPR.
[34]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the NIPS.
[35]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the ECCV.
[36]
Guy Lev, Gil Sadeh, Benjamin Klein, and Lior Wolf. 2016. RNN fisher vectors for action recognition and image annotation. In Proceedings of the ECCV.
[37]
Kai Li, Guo-Jun Qi, and Kien A. Hua. 2017. Learning label preserving binary codes for multimedia retrieval: A general approach. ACM Trans. Multimedia Comput. Commun. Appl. 14, 1 (2017), 2.
[38]
Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xiaogang Wang. 2017. Identity-aware textual-visual matching with latent co-attention. In Proceedings of the ICCV.
[39]
Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017. Person search with natural language description. In Proceedings of the CVPR.
[40]
Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. 2014. Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the CVPR.
[41]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the ECCV.
[42]
Xiao Lin and Devi Parikh. 2016. Leveraging visual question answering for image-caption ranking. In Proceedings of the ECCV.
[43]
Yutian Lin, Liang Zheng, Zhedong Zheng, Yu Wu, Zhilan Hu, Chenggang Yan, and Yi Yang. 2019. Improving person re-identification by attribute and identity learning. Pattern Recogn. 95 (2019), 151--161.
[44]
Ruoyu Liu, Yao Zhao, Shikui Wei, Liang Zheng, and Yi Yang. 2019. Modality-invariant image-text embedding for image-sentence matching. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1 (2019), 1--19.
[45]
Xianglong Liu, Lei Huang, Cheng Deng, Bo Lang, and Dacheng Tao. 2016. Query-adaptive hash code ranking for large-scale multi-view visual search. IEEE Transactions on Image Processing 25, 10 (2016), 4514–4524.
[46]
Yu Liu, Yanming Guo, Erwin M. Bakker, and Michael S. Lew. 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the ICCV.
[47]
Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. 2015. Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the ICCV.
[48]
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-rnn). In Proceedings of the ICLR.
[49]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781.
[50]
Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of the Interspeech.
[51]
Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the CVPR.
[52]
Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. 2017. Hierarchical multimodal LSTM for dense visual-semantic embedding. In Proceedings of the ICCV.
[53]
Xuelin Qian, Yanwei Fu, Yu-Gang Jiang, Tao Xiang, and Xiangyang Xue. 2017. Multi-scale deep learning architectures for person re-identification. In Proceedings of the CVPR.
[54]
Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using Amazon’s mechanical turk. In Proceedings of the NAACL HLT. Association for Computational Linguistics, 139--147.
[55]
Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the ACM MM.
[56]
Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 2016. Learning deep representations of fine-grained visual descriptions. In Proceedings of the CVPR.
[57]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 3 (2015), 211--252.
[58]
Abhishek Sharma, Abhishek Kumar, Hal Daume, and David W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In Proceedings of the CVPR.
[59]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
[60]
A. Vedaldi and K. Lenc. 2015. MatConvNet—Convolutional neural networks for MATLAB. In Proceedings of the ACM MM.
[61]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the CVPR.
[62]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2017. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 652--663.
[63]
Cheng Wang, Haojin Yang, and Christoph Meinel. 2018. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans. Multimedia Comput. Commun. Appl. 14, 2s (2018), 40.
[64]
Di Wang, Xinbo Gao, Xiumei Wang, Lihuo He, and Bo Yuan. 2016. Multimodal discriminative binary embedding for large-scale cross-modal retrieval. IEEE Transactions on Image Processing 25, 10 (2016), 4540–4554.
[65]
Kaiye Wang, Ran He, Wei Wang, Liang Wang, and Tieniu Tan. 2013. Learning coupled feature spaces for cross-modal matching. In Proceedings of the ICCV.
[66]
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the CVPR.
[67]
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2017. Learning two-branch neural networks for image-text matching tasks. arXiv:1704.03470.
[68]
Wei Wang, Beng Chin Ooi, Xiaoyan Yang, Dongxiang Zhang, and Yueting Zhuang. 2014. Effective multi-modal retrieval based on stacked auto-encoders. Proc. VLDB Endow. 7, 8 (2014), 649--660.
[69]
Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-modal retrieval with cnn visual features: A new baseline. IEEE Trans. Cybernet. 47, 2 (2017), 449--460.
[70]
Yunchao Wei, Yao Zhao, Zhenfeng Zhu, Shikui Wei, Yanhui Xiao, Jiashi Feng, and Shuicheng Yan. 2016. Modality-dependent cross-media retrieval. ACM Trans. Intell. Syst. Technol. 7, 4 (2016), 1–13.
[71]
Fei Wu, Xinyan Lu, Zhongfei Zhang, Shuicheng Yan, Yong Rui, and Yueting Zhuang. 2013. Cross-media semantic representation via bi-directional learning to rank. In Proceedings of the ACM MM.
[72]
Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wei Bian, and Yi Yang. 2019. Progressive learning for person re-identification with one example. IEEE Trans. Image Process. 28, 6 (June 2019), 2872--2881.
[73]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144.
[74]
Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the CVPR.
[75]
Yan Yan, Feiping Nie, Wen Li, Chenqiang Gao, Yi Yang, and Dong Xu. 2016. Image classification by cross-media active learning with privileged information. IEEE Trans. Multimedia 18, 12 (2016), 2494--2502.
[76]
Erkun Yang, Cheng Deng, Chao Li, Wei Liu, Jie Li, and Dacheng Tao. 2018. Shared predictive cross-modal deep quantization. IEEE Trans. Neural Netw. Learn. Syst.99 (2018), 1--12.
[77]
Yi Yang, Feiping Nie, Dong Xu, Jiebo Luo, Yueting Zhuang, and Yunhe Pan. 2011. A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans. Pattern Anal. Mach. Intell. 34, 4 (2011), 723--742.
[78]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2 (2014), 67--78.
[79]
Changqing Zhang, Huazhu Fu, Qinghua Hu, Pengfei Zhu, and Xiaochun Cao. 2017. Flexible multi-view dimensionality co-reduction. IEEE Transactions on Image Processing 26, 2 (2016), 648–659.
[80]
Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell. 2014. Part-based R-CNNs for fine-grained category detection. In Proceedings of the ECCV.
[81]
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Proceedings of the NIPS.
[82]
Yuting Zhang, Luyao Yuan, Yijie Guo, Zhiyuan He, I-An Huang, and Honglak Lee. 2017. Discriminative bimodal networks for visual localization and detection with natural language queries. In Proceedings of the CVPR.
[83]
Liang Zheng, Yi Yang, and Alexander G. Hauptmann. 2016. Person re-identification: Past, present, and future. arXiv:1610.02984.
[84]
Zhedong Zheng, Liang Zheng, and Yi Yang. 2017. A discriminatively learned CNN embedding for person re-identification. ACM Trans. Multimedia Comput. Commun. Appl. 14, 1 (2017), 1–20.
[85]
Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Uncovering the temporal context for video question answering. Int. J. Comput. Vision 124, 3 (2017), 409--421.

Cited By

View all
  • (2024)A Contrastive Learning Based Multiview Scene Matching Method for UAV View Geo-LocalizationRemote Sensing10.3390/rs1616303916:16(3039)Online publication date: 19-Aug-2024
  • (2024)A Satellite-Drone Image Cross-View Geolocalization Method Based on Multi-Scale Information and Dual-Channel Attention MechanismRemote Sensing10.3390/rs1606094116:6(941)Online publication date: 7-Mar-2024
  • (2024)Image–Text Cross-Modal Retrieval with Instance Contrastive EmbeddingElectronics10.3390/electronics1302030013:2(300)Online publication date: 9-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 2
May 2020
390 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3401894
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 May 2020
Online AM: 07 May 2020
Accepted: 01 February 2020
Revised: 01 May 2019
Received: 01 August 2018
Published in TOMM Volume 16, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Image-sentence retrieval
  2. convolutional neural networks
  3. cross-modal retrieval
  4. language-based person search

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)222
  • Downloads (Last 6 weeks)27
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Contrastive Learning Based Multiview Scene Matching Method for UAV View Geo-LocalizationRemote Sensing10.3390/rs1616303916:16(3039)Online publication date: 19-Aug-2024
  • (2024)A Satellite-Drone Image Cross-View Geolocalization Method Based on Multi-Scale Information and Dual-Channel Attention MechanismRemote Sensing10.3390/rs1606094116:6(941)Online publication date: 7-Mar-2024
  • (2024)Image–Text Cross-Modal Retrieval with Instance Contrastive EmbeddingElectronics10.3390/electronics1302030013:2(300)Online publication date: 9-Jan-2024
  • (2024)Learning hierarchical embedding space for image-text matchingIntelligent Data Analysis10.3233/IDA-23021428:3(647-665)Online publication date: 1-Jan-2024
  • (2024)CGKPN: Cross-Graph Knowledge Propagation Network with Adaptive Connection for Reasoning-Based Machine Reading ComprehensionACM Transactions on Intelligent Systems and Technology10.1145/365867315:4(1-24)Online publication date: 17-Apr-2024
  • (2024)Fine-grained Semantics-aware Representation Learning for Text-based Person RetrievalProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658054(92-100)Online publication date: 30-May-2024
  • (2024)DP-GCN: Node Classification by Connectivity and Local Topology Structure on Real-World NetworkACM Transactions on Knowledge Discovery from Data10.1145/364946018:6(1-20)Online publication date: 12-Apr-2024
  • (2024)MACA: Memory-aided Coarse-to-fine Alignment for Text-based Person SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657915(2497-2501)Online publication date: 10-Jul-2024
  • (2024)Parameter-Efficient Person Re-Identification in the 3D SpaceIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.321483435:6(7534-7547)Online publication date: Jun-2024
  • (2024)Hierarchical Camera-Aware Contrast Extension for Unsupervised Person Re-IdentificationIEEE Transactions on Multimedia10.1109/TMM.2024.336990426(7636-7648)Online publication date: 26-Feb-2024
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media