Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Cross-Modal Semantic Alignment Learning for Text-Based Person Search

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14554))

Included in the following conference series:

  • 793 Accesses

Abstract

Text-based person search aims to retrieve pedestrian images corresponding to a specific identity based on a textual description. Existing methods primarily focus on either the alignment of global features through well-designed loss functions or the alignment of local features via attention mechanisms. However, these approaches overlook the extraction of crucial local cues and incur high computational costs associated with cross-modality similarity scores. To address these limitations, we propose a novel Cross-Modal Semantic Alignment Learning approach (SAL), which effectively facilitates the learning of discriminative representations with efficient and accurate cross-modal alignment. Specifically, we devise a Token Clustering Learning module that excavates crucial clues by clustering visual and textual token features extracted from the backbone into fine-grained compact part prototypes, each of which is corresponding to a specific identity-related discriminative semantic. Furthermore, we introduce the optimal transport strategy to explicitly encourage the fine-grained semantic alignment of image-text part prototypes, achieving efficient and accurate cross-modal matching while largely reducing computational costs. Extensive experiments on two public datasets demonstrate the effectiveness and superiority of SAL for text-based person search.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Chen, D., et al.: Improving deep visual representation for person re-identification by global and local image-language association. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 56–73. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_4

    Chapter  Google Scholar 

  2. Chen, X., et al.: Salience-guided cascaded suppression network for person re-identification. In: CVPR, pp. 3300–3310 (2020)

    Google Scholar 

  3. Chen, Y.-C., et al.: UNITER: UNiversal Image-TExt Representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7

    Chapter  Google Scholar 

  4. Chen, Y., Zhang, G., Lu, Y., Wang, Z., Zheng, Y.: TIPCB: a simple but effective part-based convolutional baseline for text-based person search. Neurocomputing 494, 171–181 (2022)

    Article  Google Scholar 

  5. Chen, Y., Zhang, G., Zhang, H., Zheng, Y., Lin, W.: Multi-level part-aware feature disentangling for text-based person search. In: ICME, pp. 2801–2806 (2023)

    Google Scholar 

  6. Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. In: Advances in Neural Information Processing Systems, vol. 26 (2013)

    Google Scholar 

  7. Ding, Z., Ding, C., Shao, Z., Tao, D.: Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv preprint arXiv:2107.12666 (2021)

  8. Farenzena, M., Bazzani, L., Perina, A., Murino, V., Cristani, M.: Person re-identification by symmetry-driven accumulation of local features. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2360–2367 (2010)

    Google Scholar 

  9. Farooq, A., Awais, M., Kittler, J., Khalid, S.S.: AXM-Net: implicit cross-modal feature alignment for person re-identification. In: AAAI, pp. 4477–4485 (2022)

    Google Scholar 

  10. Gao, C., et al.: Contextual non-local alignment over full-scale representation for text-based person search. arXiv preprint arXiv:2101.03036 (2021)

  11. Han, X., He, S., Zhang, L., Xiang, T.: Text-based person search with limited data. arXiv preprint arXiv:2110.10807 (2021)

  12. He, T., Jin, X., Shen, X., Huang, J., Chen, Z., Hua, X.S.: Dense interaction learning for video-based person re-identification. In: ICCV, pp. 1490–1501 (2021)

    Google Scholar 

  13. Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017)

  14. Hou, R., Ma, B., Chang, H., Gu, X., Shan, S., Chen, X.: Interaction-and-aggregation network for person re-identification. In: CVPR, pp. 9317–9326 (2019)

    Google Scholar 

  15. Jampani, V., Sun, D., Liu, M.-Y., Yang, M.-H., Kautz, J.: Superpixel sampling networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 363–380. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_22

    Chapter  Google Scholar 

  16. Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-guided multi-granularity attention network for text-based person search. In: AAAI, pp. 11189–11196 (2020)

    Google Scholar 

  17. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  18. Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 212–228. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_13

    Chapter  Google Scholar 

  19. Li, S., Lu, A., Huang, Y., Li, C., Wang, L.: Joint token and feature alignment framework for text-based person search. IEEE Signal Process. Lett. 29, 2238–2242 (2022)

    Article  Google Scholar 

  20. Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: CVPR, pp. 1970–1979 (2017)

    Google Scholar 

  21. Liu, J., Zha, Z.J., Hong, R., Wang, M., Zhang, Y.: Deep adversarial graph attention convolution network for text-based person search. In: ACM MM, pp. 665–673 (2019)

    Google Scholar 

  22. Niu, K., Huang, Y., Ouyang, W., Wang, L.: Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Trans. Image Process. 29, 5542–5556 (2020)

    Article  Google Scholar 

  23. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)

    Google Scholar 

  24. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015)

  25. Shao, Z., Zhang, X., Fang, M., Lin, Z., Wang, J., Ding, C.: Learning granularity-unified representations for text-to-image person re-identification. In: ACM MM, pp. 5566–5574 (2022)

    Google Scholar 

  26. Shen, Y., Li, H., Yi, S., Chen, D., Wang, X.: Person re-identification with deep similarity-guided graph neural network. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 508–526. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_30

    Chapter  Google Scholar 

  27. Shu, X., et al.: See finer, see more: implicit modality alignment for text-based person retrieval. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) Computer Vision – ECCV 2022 Workshops. ECCV 2022. LNCS, vol. 13805, pp. 624–641. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-25072-9_42

  28. Wang, C., Luo, Z., Lin, Y., Li, S.: Text-based person search via multi-granularity embedding learning. In: IJCAI, pp. 1068–1074 (2021)

    Google Scholar 

  29. Wang, Z., Fang, Z., Wang, J., Yang, Y.: ViTAA: visual-textual attributes alignment in person search by natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 402–420. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_24

    Chapter  Google Scholar 

  30. Wang, Z., et al.: CAIBC: capturing all-round information beyond color for text-based person retrieval. In: ACM MM, pp. 5314–5322 (2022)

    Google Scholar 

  31. Wang, Z., et al.: Look before you leap: improving text-based person retrieval by learning a consistent cross-modal common manifold. In: ACM MM, pp. 1984–1992 (2022)

    Google Scholar 

  32. Wei, L., Zhang, S., Gao, W., Tian, Q.: Person transfer GAN to bridge domain gap for person re-identification. In: CVPR, pp. 79–88 (2018)

    Google Scholar 

  33. Wu, B., Cheng, R., Zhang, P., Vajda, P., Gonzalez, J.E.: Data efficient language-supervised zero-shot recognition with optimal transport distillation. arXiv preprint arXiv:2112.09445 (2021)

  34. Wu, Y., Yan, Z., Han, X., Li, G., Zou, C., Cui, S.: LapsCore: language-guided person search via color reasoning. In: ICCV, pp. 1624–1633 (2021)

    Google Scholar 

  35. Yan, S., Dong, N., Zhang, L., Tang, J.: Clip-driven fine-grained text-image person re-identification. arXiv preprint arXiv:2210.10276 (2022)

  36. Zhang, Y., Liu, D., Zha, Z.J.: Improving triplet-wise training of convolutional neural network for vehicle re-identification. In: ICME, pp. 1386–1391 (2017)

    Google Scholar 

  37. Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 707–723. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_42

    Chapter  Google Scholar 

  38. Zheng, K., Liu, W., He, L., Mei, T., Luo, J., Zha, Z.J.: Group-aware label transfer for domain adaptive person re-identification. In: CVPR, pp. 5310–5319 (2021)

    Google Scholar 

  39. Zheng, K., Liu, W., Liu, J., Zha, Z.J., Mei, T.: Hierarchical Gumbel attention network for text-based person search. In: ACM MM, pp. 3441–3449 (2020)

    Google Scholar 

  40. Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Xu, M., Shen, Y.D.: Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 16, 1–23 (2020)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (NSFC) under Grants 62106245, 62225207 and U19B2038.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiawei Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gan, W., Liu, J., Zhu, Y., Wu, Y., Zhao, G., Zha, ZJ. (2024). Cross-Modal Semantic Alignment Learning for Text-Based Person Search. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14554. Springer, Cham. https://doi.org/10.1007/978-3-031-53305-1_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-53305-1_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-53304-4

  • Online ISBN: 978-3-031-53305-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics