Abstract
Text-based person search aims to retrieve pedestrian images corresponding to a specific identity based on a textual description. Existing methods primarily focus on either the alignment of global features through well-designed loss functions or the alignment of local features via attention mechanisms. However, these approaches overlook the extraction of crucial local cues and incur high computational costs associated with cross-modality similarity scores. To address these limitations, we propose a novel Cross-Modal Semantic Alignment Learning approach (SAL), which effectively facilitates the learning of discriminative representations with efficient and accurate cross-modal alignment. Specifically, we devise a Token Clustering Learning module that excavates crucial clues by clustering visual and textual token features extracted from the backbone into fine-grained compact part prototypes, each of which is corresponding to a specific identity-related discriminative semantic. Furthermore, we introduce the optimal transport strategy to explicitly encourage the fine-grained semantic alignment of image-text part prototypes, achieving efficient and accurate cross-modal matching while largely reducing computational costs. Extensive experiments on two public datasets demonstrate the effectiveness and superiority of SAL for text-based person search.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chen, D., et al.: Improving deep visual representation for person re-identification by global and local image-language association. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 56–73. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_4
Chen, X., et al.: Salience-guided cascaded suppression network for person re-identification. In: CVPR, pp. 3300–3310 (2020)
Chen, Y.-C., et al.: UNITER: UNiversal Image-TExt Representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Chen, Y., Zhang, G., Lu, Y., Wang, Z., Zheng, Y.: TIPCB: a simple but effective part-based convolutional baseline for text-based person search. Neurocomputing 494, 171–181 (2022)
Chen, Y., Zhang, G., Zhang, H., Zheng, Y., Lin, W.: Multi-level part-aware feature disentangling for text-based person search. In: ICME, pp. 2801–2806 (2023)
Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. In: Advances in Neural Information Processing Systems, vol. 26 (2013)
Ding, Z., Ding, C., Shao, Z., Tao, D.: Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv preprint arXiv:2107.12666 (2021)
Farenzena, M., Bazzani, L., Perina, A., Murino, V., Cristani, M.: Person re-identification by symmetry-driven accumulation of local features. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2360–2367 (2010)
Farooq, A., Awais, M., Kittler, J., Khalid, S.S.: AXM-Net: implicit cross-modal feature alignment for person re-identification. In: AAAI, pp. 4477–4485 (2022)
Gao, C., et al.: Contextual non-local alignment over full-scale representation for text-based person search. arXiv preprint arXiv:2101.03036 (2021)
Han, X., He, S., Zhang, L., Xiang, T.: Text-based person search with limited data. arXiv preprint arXiv:2110.10807 (2021)
He, T., Jin, X., Shen, X., Huang, J., Chen, Z., Hua, X.S.: Dense interaction learning for video-based person re-identification. In: ICCV, pp. 1490–1501 (2021)
Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017)
Hou, R., Ma, B., Chang, H., Gu, X., Shan, S., Chen, X.: Interaction-and-aggregation network for person re-identification. In: CVPR, pp. 9317–9326 (2019)
Jampani, V., Sun, D., Liu, M.-Y., Yang, M.-H., Kautz, J.: Superpixel sampling networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 363–380. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_22
Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-guided multi-granularity attention network for text-based person search. In: AAAI, pp. 11189–11196 (2020)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 212–228. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_13
Li, S., Lu, A., Huang, Y., Li, C., Wang, L.: Joint token and feature alignment framework for text-based person search. IEEE Signal Process. Lett. 29, 2238–2242 (2022)
Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: CVPR, pp. 1970–1979 (2017)
Liu, J., Zha, Z.J., Hong, R., Wang, M., Zhang, Y.: Deep adversarial graph attention convolution network for text-based person search. In: ACM MM, pp. 665–673 (2019)
Niu, K., Huang, Y., Ouyang, W., Wang, L.: Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Trans. Image Process. 29, 5542–5556 (2020)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015)
Shao, Z., Zhang, X., Fang, M., Lin, Z., Wang, J., Ding, C.: Learning granularity-unified representations for text-to-image person re-identification. In: ACM MM, pp. 5566–5574 (2022)
Shen, Y., Li, H., Yi, S., Chen, D., Wang, X.: Person re-identification with deep similarity-guided graph neural network. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 508–526. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_30
Shu, X., et al.: See finer, see more: implicit modality alignment for text-based person retrieval. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) Computer Vision – ECCV 2022 Workshops. ECCV 2022. LNCS, vol. 13805, pp. 624–641. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-25072-9_42
Wang, C., Luo, Z., Lin, Y., Li, S.: Text-based person search via multi-granularity embedding learning. In: IJCAI, pp. 1068–1074 (2021)
Wang, Z., Fang, Z., Wang, J., Yang, Y.: ViTAA: visual-textual attributes alignment in person search by natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 402–420. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_24
Wang, Z., et al.: CAIBC: capturing all-round information beyond color for text-based person retrieval. In: ACM MM, pp. 5314–5322 (2022)
Wang, Z., et al.: Look before you leap: improving text-based person retrieval by learning a consistent cross-modal common manifold. In: ACM MM, pp. 1984–1992 (2022)
Wei, L., Zhang, S., Gao, W., Tian, Q.: Person transfer GAN to bridge domain gap for person re-identification. In: CVPR, pp. 79–88 (2018)
Wu, B., Cheng, R., Zhang, P., Vajda, P., Gonzalez, J.E.: Data efficient language-supervised zero-shot recognition with optimal transport distillation. arXiv preprint arXiv:2112.09445 (2021)
Wu, Y., Yan, Z., Han, X., Li, G., Zou, C., Cui, S.: LapsCore: language-guided person search via color reasoning. In: ICCV, pp. 1624–1633 (2021)
Yan, S., Dong, N., Zhang, L., Tang, J.: Clip-driven fine-grained text-image person re-identification. arXiv preprint arXiv:2210.10276 (2022)
Zhang, Y., Liu, D., Zha, Z.J.: Improving triplet-wise training of convolutional neural network for vehicle re-identification. In: ICME, pp. 1386–1391 (2017)
Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 707–723. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_42
Zheng, K., Liu, W., He, L., Mei, T., Luo, J., Zha, Z.J.: Group-aware label transfer for domain adaptive person re-identification. In: CVPR, pp. 5310–5319 (2021)
Zheng, K., Liu, W., Liu, J., Zha, Z.J., Mei, T.: Hierarchical Gumbel attention network for text-based person search. In: ACM MM, pp. 3441–3449 (2020)
Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Xu, M., Shen, Y.D.: Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 16, 1–23 (2020)
Acknowledgements
This work was supported by National Natural Science Foundation of China (NSFC) under Grants 62106245, 62225207 and U19B2038.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Gan, W., Liu, J., Zhu, Y., Wu, Y., Zhao, G., Zha, ZJ. (2024). Cross-Modal Semantic Alignment Learning for Text-Based Person Search. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14554. Springer, Cham. https://doi.org/10.1007/978-3-031-53305-1_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-53305-1_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53304-4
Online ISBN: 978-3-031-53305-1
eBook Packages: Computer ScienceComputer Science (R0)