Attention Based CNN-ConvLSTM for Pedestrian Attribute Recognition
Abstract
:1. Introduction
- Existing methods use Recurrent Neural Network (RNN) or LSTM to excavate the correlation of pedestrian attributes. The spatial information of attributes is lost during this procedure, however, spatial information is important to improve the performance of pedestrian attribute recognition. In this paper, in order to better mine the spatial and semantic correlation between attributes, ConvLSTM, which can retain the spatial information by using convolution operation in input-to-state and state-to-state transition, is adopted. An end-to-end trainable model is established by superposing several ConvLSTMs to extract spatiotemporal correlation information from the predicted pedestrian attribute sequence.
- CNN is used as a visual feature extractor for most deep learning based pedestrian attribute recognition methods. Channel attention (CAtt) can adaptively adjust the weight of relevant channel features according to the correlation of input features; it is very helpful for improving feature extraction and attribute recognition performance. However, none of the existing pedestrian attribute recognition methods use a CAtt mechanism. In this paper, the most relevant and salient visual features of pedestrian attributes are extracted and re-adjusted using a CAtt mechanism. The CAtt is seamlessly integrated with ConvLSTM since the CAtt weights for feature re-adjusting are calculated from both visual features and the hidden stats of the ConvLSTM. As far as we know, this is the first time that a CAtt mechanism has been used in pedestrian attribute recognition.
- For multi-label CNN-RNN methods, the prediction sequence of labels (attributes) is important. Most existing methods use a random sequence. In this paper, an optimized prediction sequence is proposed. Considering different area sizes, attributes contain different amounts of information; global attributes (such as gender, age range, etc.) have a larger amount of information, whereas local attributes (such as hair, footwear, etc.) have a smaller amount of information. Attributes with larger amounts of information are easier to recognize accurately. Corresponding to the intuition that easier attributes should be predicted first to help predict more difficult attributes, an optimized prediction sequence from global attributes to local attributes has been put forward to further improve the performance of attribute recognition.
- Extensive experiments are carried out to analyze and verify this method. In-depth comparisons are conducted with seven other state-of-the-art (SOTA) models on two common pedestrian attribute benchmark datasets, PETA [16] and RAP [17]. Compared with these models, the CNN-CAtt-ConvLSTM model proposed in this paper yields superior performance.
2. Related Works
2.1. Pedestrian Attribute Recognition with Hand-Crafted Features
2.2. Pedestrian Attribute Recognition with Deep Learning
3. Proposed Method
3.1. Architecture of the Model
3.2. ConvLSTM in the Model
3.3. CAtt in the Model
3.4. Loss Function in the Model
4. Experiments and Discussions
4.1. Datasets
4.2. Evaluation Metrics
4.3. Implementation Details
4.4. Comparison with Other Methods
4.5. Further Analysis and Discussions
4.5.1. The Effect of ConvLSTM and CAtt
4.5.2. The Effect of Optimized Prediction Sequence
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Wang, X.; Zheng, S.; Yang, R.; Luo, B.; Tang, J. Pedestrian Attribute Recognition: A Survey. arXiv 2019, arXiv:1901.07474. [Google Scholar]
- Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3730–3738. [Google Scholar]
- Su, C.; Zhang, S.; Xing, J.; Gao, W.; Tian, Q. Deep attributes driven multi-camera person re-identification. In Proceedings of the European conference on computer vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 475–491. [Google Scholar]
- Lin, Y.; Zheng, L.; Zheng, Z.; Wu, Y.; Hu, Z.; Yan, C.; Yang, Y. Improving person re-identification by attribute and identity learning. Patt. Recognit. 2019, 95, 151–161. [Google Scholar] [CrossRef] [Green Version]
- Feris, R.; Bobbitt, R.; Brown, L.; Pankanti, S. Attribute-based people search: Lessons learnt from a practical surveillance system. In Proceedings of the ACM International Conference on Multimedia Retrieval, Glasgow, Scotland, 1–4 April 2014; p. 153. [Google Scholar]
- Wang, X.; Zhang, T.; Tretter, D.R.; Lin, Q. Personal clothing retrieval on photo collections by color and attributes. IEEE Trans. Multimed. 2013, 15, 2035–2045. [Google Scholar] [CrossRef]
- Reid, D.; Nixon, M.; Stevenage, S. Soft biometrics; human identification using comparative descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1216–1228. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Li, D.; Chen, X.; Huang, K. Multi-attribute learning for pedestrian attribute recognition in surveillance scenarios. In Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, 3–6 November 2015; pp. 111–115. [Google Scholar]
- Peng, P.; Tian, Y.; Xiang, T.; Wang, Y.; Pontil, M.; Huang, T. Joint semantic and latent attribute modelling for cross-class transfer learning. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1625–1638. [Google Scholar] [CrossRef] [PubMed]
- Zhu, J.; Liao, S.; Lei, Z.; Li, S. Multi-label convolutional neural network based pedestrian attribute classification. Image Vis. Comput. 2016, 58, 224–229. [Google Scholar] [CrossRef]
- Wang, J.; Zhu, X.; Gong, S. Discovering visual concept structure with sparse and incomplete tags. Artif. Intell. 2017, 250, 16–36. [Google Scholar] [CrossRef] [Green Version]
- Liu, X.; Zhao, H.; Tian, M.; Sheng, L.; Shao, J.; Yi, S.; Yan, J.; Wang, X. Hydraplus-net: Attentive deep features for pedestrian analysis. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 350–359. [Google Scholar]
- Zhou, Y.; Yu, K.; Leng, B.; Zhang, Z.; Li, D.; Huang, K. Weakly-supervised Learning of Mid-level Features for Pedestrian Attribute Recognition and Localization. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 4–7 September 2017; pp. 69.1–69.12. [Google Scholar]
- Li, Y.; Huang, C.; Loy, C.; Tang, X. Human Attribute Recognition by Deep Hierarchical Contexts. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 684–700. [Google Scholar]
- Li, Y.; Lin, G.; Zhuang, B.; Liu, L.; Shen, C.; Hengel, A. Sequential person recognition in photo albums with a recurrent network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1338–1346. [Google Scholar]
- Deng, Y.; Luo, P.; Loy, C.C.; Tang, X. Pedestrian attribute recognition at far distance. In Proceedings of the 22nd ACM international conference on Multimedia, Orlando, Florida, USA, 3–7 November 2014; pp. 789–792. [Google Scholar]
- Li, D.; Zhang, Z.; Chen, X.; Ling, H.; Huang, K. A richly annotated dataset for pedestrian attribute recognition. arXiv 2016, arXiv:1603.07054. [Google Scholar]
- Jaha, E.S.; Nixon, M.S. Soft biometrics for subject identification using clothing attributes. In Proceedings of the IEEE International Joint Conference on Biometrics, Clearwater, FL, USA, 29 September–2 October 2014; pp. 1–6. [Google Scholar]
- Chen, H.; Gallagher, A.; Girod, B. Describing clothing by semantic attributes. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 609–623. [Google Scholar]
- Shi, Z.; Hospedales, T.M.; Xiang, T. Transferring a semantic representation for person re-identification and search. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4184–4193. [Google Scholar]
- Deng, Y.; Luo, P.; Loy, C.C.; Tang, X. Learning to recognize pedestrian attribute. arXiv 2015, arXiv:1501.00901. [Google Scholar]
- Gkioxari, G.; Girshick, R.; Malik, J. Contextual action recognition with R*CNN. Int. J. Cancer 2015, 40, 1080–1088. [Google Scholar]
- Zhang, N.; Paluri, M.; Ranzato, M.; Darrell, T.; Bourdev, L. Panda: Pose aligned networks for deep attribute modeling. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, 24–27 June 2014; pp. 1637–1644. [Google Scholar]
- Zhu, J.; Liao, S.; Yi, D.; Lei, Z.; Li, S.Z. Multi-label cnn based pedestrian attribute learning for soft biometrics. In Proceedings of the 2015 International Conference on Biometrics (ICB), Phuket, Thailand, 19–22 May 2015; pp. 535–540. [Google Scholar]
- Fabbri, M.; Calderara, S.; Cucchiara, R. Generative adversarial models for people attribute recognition in surveillance. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6. [Google Scholar]
- Wang, J.; Zhu, X.; Gong, S.; Li, W. Attribute recognition by joint recurrent learning of context and correlation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 531–540. [Google Scholar]
- Sudowe, P.; Spitzer, H.; Leibe, B. Person attribute recognition with a jointly-trained holistic CNN model. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 7–13 December 2015; pp. 87–95. [Google Scholar]
- Li, D.; Chen, X.; Zhang, Z.; Huang, K. Pose Guided Deep Model for Pedestrian Attribute Recognition in Surveillance Scenarios. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
- Zhao, X.; Sang, L.; Ding, G.; Han, J.; Di, N.; Yan, C. Recurrent attention model for pedestrian attribute recognition. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
- Liu, H.; Wu, J.; Jiang, J.; Qi, M.; Bo, R. Sequence-based Person Attribute Recognition with Joint CTC-Attention Model. arXiv 2018, arXiv:1811.08115. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long Short–Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Xingjian, S.; Chen, Z.; Wang, H.; Yeung, D.; Wong, W.; Woo, W. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–10 December 2015; pp. 802–810. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G.; Wu, E. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2818–2826. [Google Scholar]
- Liu, F.; Xiang, T.; Hospedales, T.M.; Yang, W.; Sun, C. Semantic regularisation for recurrent image annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2872–2880. [Google Scholar]
Metric | PETA | RAP | |||||||
---|---|---|---|---|---|---|---|---|---|
Method | mA | mP | mR | F1 | mA | mP | mR | F1 | |
ACN [27] | 81.15 | 84.06 | 81.26 | 82.64 | 69.66 | 80.12 | 72.26 | 75.98 | |
DeepMAR [8] | 81.50 | 89.70 | 81.90 | 85.68 | 76.10 | 82.20 | 74.80 | 78.30 | |
HP-net [12] | 81.77 | 84.92 | 83.24 | 84.07 | 76.12 | 77.33 | 78.79 | 78.05 | |
CTX [15] | 80.13 | 79.68 | 80.24 | 79.68 | 70.13 | 71.03 | 71.20 | 70.23 | |
SR [35] | 82.83 | 82.54 | 82.76 | 82.65 | 74.21 | 75.11 | 76.52 | 75.83 | |
JRL [26] | 85.67 | 86.03 | 85.34 | 85.42 | 77.81 | 78.11 | 78.98 | 78.58 | |
RA [29] | 86.11 | 84.69 | 88.51 | 86.56 | 81.16 | 79.45 | 79.23 | 79.34 | |
Ours | 88.56 | 88.32 | 89.62 | 88.97 | 83.72 | 81.85 | 79.96 | 80.89 |
Metric | PETA | RAP | |||||||
---|---|---|---|---|---|---|---|---|---|
Method | mA | mP | mR | F1 | mA | mP | mR | F1 | |
MLCNN | 79.86 | 81.73 | 79.92 | 80.81 | 68.22 | 72.46 | 71.34 | 71.90 | |
CNN-LSTM | 81.63 | 83.25 | 82.54 | 82.89 | 74.63 | 75.97 | 76.62 | 76.29 | |
CNN-SAtt-LSTM | 85.13 | 85.75 | 84.95 | 85.35 | 77.49 | 77.85 | 78.32 | 78.08 | |
CNN-ConvLSTM | 85.92 | 85.21 | 86.12 | 85.66 | 79.35 | 78.73 | 78.65 | 78.69 | |
CNN-SAtt-ConvLSTM | 86.08 | 85.34 | 86.22 | 85.78 | 79.48 | 78.83 | 78.77 | 78.80 | |
CNN-CAtt-ConvLSTM (Ours) | 88.56 | 88.32 | 89.62 | 88.97 | 83.72 | 81.85 | 79.96 | 80.89 |
Metric | PETA | RAP | |||||||
---|---|---|---|---|---|---|---|---|---|
Method | mA | mP | mR | F1 | mA | mP | mR | F1 | |
Random sequence | 88.01 | 87.81 | 89.13 | 88.47 | 83.13 | 81.32 | 79.46 | 8038 | |
Optimized sequence | 88.56 | 88.32 | 89.62 | 88.97 | 83.72 | 81.85 | 79.96 | 80.89 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Y.; Xu, H.; Bian, M.; Xiao, J. Attention Based CNN-ConvLSTM for Pedestrian Attribute Recognition. Sensors 2020, 20, 811. https://doi.org/10.3390/s20030811
Li Y, Xu H, Bian M, Xiao J. Attention Based CNN-ConvLSTM for Pedestrian Attribute Recognition. Sensors. 2020; 20(3):811. https://doi.org/10.3390/s20030811
Chicago/Turabian StyleLi, Yang, Huahu Xu, Minjie Bian, and Junsheng Xiao. 2020. "Attention Based CNN-ConvLSTM for Pedestrian Attribute Recognition" Sensors 20, no. 3: 811. https://doi.org/10.3390/s20030811