Article

Modality Distillation with Multiple Stream Networks for Action Recognition

Authors:

Nuno C. Garcia,

Pietro Morerio,

Vittorio MurinoAuthors Info & Claims

Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VIII

Pages 106 - 121

https://doi.org/10.1007/978-3-030-01237-3_7

Published: 08 September 2018 Publication History

Abstract

Diverse input data modalities can provide complementary cues for several tasks, usually leading to more robust algorithms and better performance. However, while a (training) dataset could be accurately designed to include a variety of sensory inputs, it is often the case that not all modalities are available in real life (testing) scenarios, where a model has to be deployed. This raises the challenge of how to learn robust representations leveraging multimodal data in the training stage, while considering limitations at test time, such as noisy or missing modalities. This paper presents a new approach for multimodal video action recognition, developed within the unified frameworks of distillation and privileged information, named generalized distillation. Particularly, we consider the case of learning representations from depth and RGB videos, while relying on RGB data only at test time. We propose a new approach to train an hallucination network that learns to distill depth features through multiplicative connections of spatiotemporal representations, leveraging soft labels and hard labels, as well as distance between feature maps. We report state-of-the-art results on video action classification on the largest multimodal dataset available for this task, the NTU RGB+D, as well as on the UWA3DII and Northwestern-UCLA.

References

[1]

Ba, L.J., Caruana, R.: Do deep nets really need to be deep? In: Proceedings of Advances in Neural Information Processing Systems (NIPS) (2014)

[2]

Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

[3]

Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 886–893. IEEE (2005)

[4]

Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M., Burgard, W.: Multimodal deep learning for robust RGB-D object recognition. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 681–687. IEEE (2015)

[5]

Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4768–4777 (2017)

[6]

Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)

[7]

Gkioxari, G., Malik, J.: Finding action tubes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 759–768. IEEE (2015)

[8]

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

[9]

He K, Zhang X, Ren S, and Sun J Leibe B, Matas J, Sebe N, and Welling M Identity mappings in deep residual networks Computer Vision – ECCV 2016 2016 Cham Springer 630-645

[10]

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: Deep Learning and Representation Learning Workshop: NIPS 2014 (2014)

[11]

Hoffman, J., Gupta, S., Darrell, T.: Learning with side information through modality hallucination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 826–834 (2016)

[12]

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)

[13]

Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8. IEEE (2008)

[14]

Liu, J., Akhtar, N., Mian, A.: Viewpoint invariant action recognition using RGB-D videos. arXiv preprint arXiv:1709.05087 (2017)

[15]

Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.: Unifying distillation and privileged information. In: Proceedings of the International Conference on Learning Representations (ICLR) (2016)

[16]

Luo, Z., Jiang, L., Hsieh, J.T., Niebles, J.C., Fei-Fei, L.: Graph distillation for action detection with privileged information. arXiv preprint arXiv:1712.00108 (2017)

[17]

Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). No. EPFL-CONF-230240 (2017)

[18]

Mallick T, Das PP, and Majumdar AK Characterizations of noise in kinect depth images: a review IEEE Sens. J. 2014 14 6 1731-1740

[19]

Ohn-Bar, E., Trivedi, M.M.: Joint angles similarities and HOG2 for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 465–470. IEEE (2013)

[20]

Rahmani, H., Bennamoun, M.: Learning action recognition model from depth and skeleton videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5832–5841 (2017)

[21]

Rahmani H, Mahmood A, Huynh D, and Mian A Histogram of oriented principal components for cross-view action recognition IEEE Trans. Pattern Anal. Mach. Intell. 2016 38 12 2430-2443

[22]

Rahmani H, Mian A, and Shah M Learning a deep model for human action recognition from novel viewpoints IEEE Trans. Pattern Anal. Mach. Intell. 2018 40 3 667-681

[23]

Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: A large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)

[24]

Shahroudy A, Ng TT, Gong Y, and Wang G Deep multimodal feature analysis for action recognition in RGB+D videos IEEE Trans. Pattern Anal. Mach. Intell. 2018 40 5 1045-1058

[25]

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)

[26]

Soo Kim, T., Reiter, A.: Interpretable 3D human action analysis with temporal convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–28 (2017)

[27]

Sun, L., Jia, K., Yeung, D.Y., Shi, B.E.: Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4597–4605 (2015)

[28]

Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

[29]

Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)

[30]

Vapnik V and Vashist A A new learning paradigm: learning using privileged information Neural Netw. 2009 22 5 544-557

[31]

Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3169–3176. IEEE (2011)

[32]

Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)

[33]

Wang, J., Nie, X., Xia, Y., Wu, Y., Zhu, S.C.: Cross-view action modeling, learning and recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2649–2656 (2014)

[34]

Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. arXiv preprint arXiv:1711.07971 (2017)

Cited By

Ma WLi SCai LKang JSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Learning modality knowledge alignment for cross-modality transferProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693442(33777-33793)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693442
Zhang YPan BGu SBai GQiu MYang XZhao LLarson K(2024)Visual attention prompted prediction and learningProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/610(5517-5525)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/610
Shi YLi YLiang SChen HMiao QCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)MGR-Dark: A Large Multimodal Video Dataset and RGB-IR Benchmark for Gesture Recognition in DarknessProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681267(2321-2330)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681267
Show More Cited By

Index Terms

Modality Distillation with Multiple Stream Networks for Action Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
  2. Machine learning
    1. Learning paradigms
    2. Machine learning approaches
      1. Neural networks
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory

Index terms have been assigned to the content through auto-classification.

Recommendations

Cross-modality online distillation for multi-view action recognition
Highlights
- We propose a Cross-modality Aggregated Transfer (CAT) network and a Multi-view Features Strengthen (MFS) network to transfer viewinvariance and multi-modality features from the teacher network to the student network.
- We build a simple ...
Abstract
Recently, some multi-modality features are introduced to the multi-view action recognition methods in order to obtain a more robust performance. However, it is intuitive that not all modalities are avail- able in real applications. For example, ...
Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods ...
Cross-scale cascade transformer for multimodal human action recognition
Highlights
- A cross-modal and cross-scale fusion module is proposed to perform multimodal feature interaction.
- The proposed fusion network can handle different multimodal input combinations and obtain significant performance improvement.
- ...
Abstract
Human action recognition can benefit from multimodal information to address the classification problem under complex situations. However, existing works either use score fusion or perform simple feature integration methods to combine multiple ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VIII

Sep 2018

845 pages

ISBN:978-3-030-01236-6

DOI:10.1007/978-3-030-01237-3

Editors:
Vittorio Ferrari
Google Research, Zurich, Switzerland
,
Martial Hebert
Carnegie Mellon University, Pittsburgh, PA, USA
,
Cristian Sminchisescu
Google Research, Zurich, Switzerland
,
Yair Weiss
Hebrew University of Jerusalem, Jerusalem, Israel

© Springer Nature Switzerland AG 2018.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 08 September 2018

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ma WLi SCai LKang JSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Learning modality knowledge alignment for cross-modality transferProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693442(33777-33793)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693442
Zhang YPan BGu SBai GQiu MYang XZhao LLarson K(2024)Visual attention prompted prediction and learningProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/610(5517-5525)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/610
Shi YLi YLiang SChen HMiao QCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)MGR-Dark: A Large Multimodal Video Dataset and RGB-IR Benchmark for Gesture Recognition in DarknessProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681267(2321-2330)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681267
Yang MWu CGuo YHe YJiang RJiang JYang Z(2024)A teacher–student deep learning strategy for extreme low resolution unsafe action recognition in construction projectsAdvanced Engineering Informatics10.1016/j.aei.2023.10229459:COnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.aei.2023.102294
Gupta SVishwakarma DPuri N(2024)A human activity recognition framework in videos using segmented human subject focusThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-023-03256-440:10(6983-6999)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1007/s00371-023-03256-4
Li LDong PLi AWei ZYang YOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)KD-zeroProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669165(69490-69504)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669165
Shen YWang XGao PLin MKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)Auxiliary modality learning with generalized curriculum distillationProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619694(31057-31076)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619694
Du CTeng JLi TLiu YYuan TWang YYuan YZhao HKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)On uni-modal feature learning in supervised multi-modal learningProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618753(8632-8656)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3618753
Zheng NSong XSu TLiu WYan YNie L(2023)Egocentric Early Action Prediction via Adversarial Knowledge DistillationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/354449319:2(1-21)Online publication date: 6-Feb-2023
https://dl.acm.org/doi/10.1145/3544493
Li LZhe JKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Shadow knowledge distillationProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3600316(635-649)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3600316
Show More Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten