Cross-Modal Semantic Alignment Learning for Text-Based Person Search

Gan, Wenjun; Liu, Jiawei; Zhu, Yangchun; Wu, Yong; Zhao, Guozhi; Zha, Zheng-Jun

doi:10.1007/978-3-031-53305-1_16

Wenjun Gan¹⁴,
Jiawei Liu¹⁴,
Yangchun Zhu¹⁴,
Yong Wu¹⁵,
Guozhi Zhao¹⁵ &
…
Zheng-Jun Zha¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14554))

Included in the following conference series:

International Conference on Multimedia Modeling

793 Accesses

Abstract

Text-based person search aims to retrieve pedestrian images corresponding to a specific identity based on a textual description. Existing methods primarily focus on either the alignment of global features through well-designed loss functions or the alignment of local features via attention mechanisms. However, these approaches overlook the extraction of crucial local cues and incur high computational costs associated with cross-modality similarity scores. To address these limitations, we propose a novel Cross-Modal Semantic Alignment Learning approach (SAL), which effectively facilitates the learning of discriminative representations with efficient and accurate cross-modal alignment. Specifically, we devise a Token Clustering Learning module that excavates crucial clues by clustering visual and textual token features extracted from the backbone into fine-grained compact part prototypes, each of which is corresponding to a specific identity-related discriminative semantic. Furthermore, we introduce the optimal transport strategy to explicitly encourage the fine-grained semantic alignment of image-text part prototypes, achieving efficient and accurate cross-modal matching while largely reducing computational costs. Extensive experiments on two public datasets demonstrate the effectiveness and superiority of SAL for text-based person search.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning shared features from specific and ambiguous descriptions for text-based person search

Article 27 March 2024

A Simple and Robust Correlation Filtering Method for Text-Based Person Search

An Adaptive Correlation Filtering Method for Text-Based Person Search

Article 16 May 2024

References

Chen, D., et al.: Improving deep visual representation for person re-identification by global and local image-language association. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 56–73. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_4
Chapter Google Scholar
Chen, X., et al.: Salience-guided cascaded suppression network for person re-identification. In: CVPR, pp. 3300–3310 (2020)
Google Scholar
Chen, Y.-C., et al.: UNITER: UNiversal Image-TExt Representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Chapter Google Scholar
Chen, Y., Zhang, G., Lu, Y., Wang, Z., Zheng, Y.: TIPCB: a simple but effective part-based convolutional baseline for text-based person search. Neurocomputing 494, 171–181 (2022)
Article Google Scholar
Chen, Y., Zhang, G., Zhang, H., Zheng, Y., Lin, W.: Multi-level part-aware feature disentangling for text-based person search. In: ICME, pp. 2801–2806 (2023)
Google Scholar
Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. In: Advances in Neural Information Processing Systems, vol. 26 (2013)
Google Scholar
Ding, Z., Ding, C., Shao, Z., Tao, D.: Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv preprint arXiv:2107.12666 (2021)
Farenzena, M., Bazzani, L., Perina, A., Murino, V., Cristani, M.: Person re-identification by symmetry-driven accumulation of local features. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2360–2367 (2010)
Google Scholar
Farooq, A., Awais, M., Kittler, J., Khalid, S.S.: AXM-Net: implicit cross-modal feature alignment for person re-identification. In: AAAI, pp. 4477–4485 (2022)
Google Scholar
Gao, C., et al.: Contextual non-local alignment over full-scale representation for text-based person search. arXiv preprint arXiv:2101.03036 (2021)
Han, X., He, S., Zhang, L., Xiang, T.: Text-based person search with limited data. arXiv preprint arXiv:2110.10807 (2021)
He, T., Jin, X., Shen, X., Huang, J., Chen, Z., Hua, X.S.: Dense interaction learning for video-based person re-identification. In: ICCV, pp. 1490–1501 (2021)
Google Scholar
Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017)
Hou, R., Ma, B., Chang, H., Gu, X., Shan, S., Chen, X.: Interaction-and-aggregation network for person re-identification. In: CVPR, pp. 9317–9326 (2019)
Google Scholar
Jampani, V., Sun, D., Liu, M.-Y., Yang, M.-H., Kautz, J.: Superpixel sampling networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 363–380. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_22
Chapter Google Scholar
Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-guided multi-granularity attention network for text-based person search. In: AAAI, pp. 11189–11196 (2020)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 212–228. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_13
Chapter Google Scholar
Li, S., Lu, A., Huang, Y., Li, C., Wang, L.: Joint token and feature alignment framework for text-based person search. IEEE Signal Process. Lett. 29, 2238–2242 (2022)
Article Google Scholar
Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: CVPR, pp. 1970–1979 (2017)
Google Scholar
Liu, J., Zha, Z.J., Hong, R., Wang, M., Zhang, Y.: Deep adversarial graph attention convolution network for text-based person search. In: ACM MM, pp. 665–673 (2019)
Google Scholar
Niu, K., Huang, Y., Ouyang, W., Wang, L.: Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Trans. Image Process. 29, 5542–5556 (2020)
Article Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015)
Shao, Z., Zhang, X., Fang, M., Lin, Z., Wang, J., Ding, C.: Learning granularity-unified representations for text-to-image person re-identification. In: ACM MM, pp. 5566–5574 (2022)
Google Scholar
Shen, Y., Li, H., Yi, S., Chen, D., Wang, X.: Person re-identification with deep similarity-guided graph neural network. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 508–526. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_30
Chapter Google Scholar
Shu, X., et al.: See finer, see more: implicit modality alignment for text-based person retrieval. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) Computer Vision – ECCV 2022 Workshops. ECCV 2022. LNCS, vol. 13805, pp. 624–641. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-25072-9_42
Wang, C., Luo, Z., Lin, Y., Li, S.: Text-based person search via multi-granularity embedding learning. In: IJCAI, pp. 1068–1074 (2021)
Google Scholar
Wang, Z., Fang, Z., Wang, J., Yang, Y.: ViTAA: visual-textual attributes alignment in person search by natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 402–420. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_24
Chapter Google Scholar
Wang, Z., et al.: CAIBC: capturing all-round information beyond color for text-based person retrieval. In: ACM MM, pp. 5314–5322 (2022)
Google Scholar
Wang, Z., et al.: Look before you leap: improving text-based person retrieval by learning a consistent cross-modal common manifold. In: ACM MM, pp. 1984–1992 (2022)
Google Scholar
Wei, L., Zhang, S., Gao, W., Tian, Q.: Person transfer GAN to bridge domain gap for person re-identification. In: CVPR, pp. 79–88 (2018)
Google Scholar
Wu, B., Cheng, R., Zhang, P., Vajda, P., Gonzalez, J.E.: Data efficient language-supervised zero-shot recognition with optimal transport distillation. arXiv preprint arXiv:2112.09445 (2021)
Wu, Y., Yan, Z., Han, X., Li, G., Zou, C., Cui, S.: LapsCore: language-guided person search via color reasoning. In: ICCV, pp. 1624–1633 (2021)
Google Scholar
Yan, S., Dong, N., Zhang, L., Tang, J.: Clip-driven fine-grained text-image person re-identification. arXiv preprint arXiv:2210.10276 (2022)
Zhang, Y., Liu, D., Zha, Z.J.: Improving triplet-wise training of convolutional neural network for vehicle re-identification. In: ICME, pp. 1386–1391 (2017)
Google Scholar
Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 707–723. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_42
Chapter Google Scholar
Zheng, K., Liu, W., He, L., Mei, T., Luo, J., Zha, Z.J.: Group-aware label transfer for domain adaptive person re-identification. In: CVPR, pp. 5310–5319 (2021)
Google Scholar
Zheng, K., Liu, W., Liu, J., Zha, Z.J., Mei, T.: Hierarchical Gumbel attention network for text-based person search. In: ACM MM, pp. 3441–3449 (2020)
Google Scholar
Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Xu, M., Shen, Y.D.: Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 16, 1–23 (2020)
Article Google Scholar

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (NSFC) under Grants 62106245, 62225207 and U19B2038.

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, 230027, China
Wenjun Gan, Jiawei Liu, Yangchun Zhu & Zheng-Jun Zha
China Merchants Bank, Shenzhen, 518040, China
Yong Wu & Guozhi Zhao

Authors

Wenjun Gan
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yangchun Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yong Wu
View author publications
You can also search for this author in PubMed Google Scholar
Guozhi Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Zheng-Jun Zha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiawei Liu .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gan, W., Liu, J., Zhu, Y., Wu, Y., Zhao, G., Zha, ZJ. (2024). Cross-Modal Semantic Alignment Learning for Text-Based Person Search. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14554. Springer, Cham. https://doi.org/10.1007/978-3-031-53305-1_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-53305-1_16
Published: 28 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53304-4
Online ISBN: 978-3-031-53305-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cross-Modal Semantic Alignment Learning for Text-Based Person Search

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning shared features from specific and ambiguous descriptions for text-based person search

A Simple and Robust Correlation Filtering Method for Text-Based Person Search

An Adaptive Correlation Filtering Method for Text-Based Person Search

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Cross-Modal Semantic Alignment Learning for Text-Based Person Search

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning shared features from specific and ambiguous descriptions for text-based person search

A Simple and Robust Correlation Filtering Method for Text-Based Person Search

An Adaptive Correlation Filtering Method for Text-Based Person Search

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation