Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-030-01237-3_7guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Modality Distillation with Multiple Stream Networks for Action Recognition

Published: 08 September 2018 Publication History

Abstract

Diverse input data modalities can provide complementary cues for several tasks, usually leading to more robust algorithms and better performance. However, while a (training) dataset could be accurately designed to include a variety of sensory inputs, it is often the case that not all modalities are available in real life (testing) scenarios, where a model has to be deployed. This raises the challenge of how to learn robust representations leveraging multimodal data in the training stage, while considering limitations at test time, such as noisy or missing modalities. This paper presents a new approach for multimodal video action recognition, developed within the unified frameworks of distillation and privileged information, named generalized distillation. Particularly, we consider the case of learning representations from depth and RGB videos, while relying on RGB data only at test time. We propose a new approach to train an hallucination network that learns to distill depth features through multiplicative connections of spatiotemporal representations, leveraging soft labels and hard labels, as well as distance between feature maps. We report state-of-the-art results on video action classification on the largest multimodal dataset available for this task, the NTU RGB+D, as well as on the UWA3DII and Northwestern-UCLA.

References

[1]
Ba, L.J., Caruana, R.: Do deep nets really need to be deep? In: Proceedings of Advances in Neural Information Processing Systems (NIPS) (2014)
[2]
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
[3]
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 886–893. IEEE (2005)
[4]
Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M., Burgard, W.: Multimodal deep learning for robust RGB-D object recognition. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 681–687. IEEE (2015)
[5]
Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4768–4777 (2017)
[6]
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)
[7]
Gkioxari, G., Malik, J.: Finding action tubes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 759–768. IEEE (2015)
[8]
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
[9]
He K, Zhang X, Ren S, and Sun J Leibe B, Matas J, Sebe N, and Welling M Identity mappings in deep residual networks Computer Vision – ECCV 2016 2016 Cham Springer 630-645
[10]
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: Deep Learning and Representation Learning Workshop: NIPS 2014 (2014)
[11]
Hoffman, J., Gupta, S., Darrell, T.: Learning with side information through modality hallucination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 826–834 (2016)
[12]
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
[13]
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8. IEEE (2008)
[14]
Liu, J., Akhtar, N., Mian, A.: Viewpoint invariant action recognition using RGB-D videos. arXiv preprint arXiv:1709.05087 (2017)
[15]
Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.: Unifying distillation and privileged information. In: Proceedings of the International Conference on Learning Representations (ICLR) (2016)
[16]
Luo, Z., Jiang, L., Hsieh, J.T., Niebles, J.C., Fei-Fei, L.: Graph distillation for action detection with privileged information. arXiv preprint arXiv:1712.00108 (2017)
[17]
Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). No. EPFL-CONF-230240 (2017)
[18]
Mallick T, Das PP, and Majumdar AK Characterizations of noise in kinect depth images: a review IEEE Sens. J. 2014 14 6 1731-1740
[19]
Ohn-Bar, E., Trivedi, M.M.: Joint angles similarities and HOG2 for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 465–470. IEEE (2013)
[20]
Rahmani, H., Bennamoun, M.: Learning action recognition model from depth and skeleton videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5832–5841 (2017)
[21]
Rahmani H, Mahmood A, Huynh D, and Mian A Histogram of oriented principal components for cross-view action recognition IEEE Trans. Pattern Anal. Mach. Intell. 2016 38 12 2430-2443
[22]
Rahmani H, Mian A, and Shah M Learning a deep model for human action recognition from novel viewpoints IEEE Trans. Pattern Anal. Mach. Intell. 2018 40 3 667-681
[23]
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: A large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)
[24]
Shahroudy A, Ng TT, Gong Y, and Wang G Deep multimodal feature analysis for action recognition in RGB+D videos IEEE Trans. Pattern Anal. Mach. Intell. 2018 40 5 1045-1058
[25]
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
[26]
Soo Kim, T., Reiter, A.: Interpretable 3D human action analysis with temporal convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–28 (2017)
[27]
Sun, L., Jia, K., Yeung, D.Y., Shi, B.E.: Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4597–4605 (2015)
[28]
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
[29]
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
[30]
Vapnik V and Vashist A A new learning paradigm: learning using privileged information Neural Netw. 2009 22 5 544-557
[31]
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3169–3176. IEEE (2011)
[32]
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
[33]
Wang, J., Nie, X., Xia, Y., Wu, Y., Zhu, S.C.: Cross-view action modeling, learning and recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2649–2656 (2014)
[34]
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. arXiv preprint arXiv:1711.07971 (2017)

Cited By

View all
  • (2024)Learning modality knowledge alignment for cross-modality transferProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693442(33777-33793)Online publication date: 21-Jul-2024
  • (2024)Visual attention prompted prediction and learningProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/610(5517-5525)Online publication date: 3-Aug-2024
  • (2024)MGR-Dark: A Large Multimodal Video Dataset and RGB-IR Benchmark for Gesture Recognition in DarknessProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681267(2321-2330)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VIII
Sep 2018
845 pages
ISBN:978-3-030-01236-6
DOI:10.1007/978-3-030-01237-3

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 08 September 2018

Author Tags

  1. Action recognition
  2. Deep multimodal learning
  3. Distillation
  4. Privileged information

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Learning modality knowledge alignment for cross-modality transferProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693442(33777-33793)Online publication date: 21-Jul-2024
  • (2024)Visual attention prompted prediction and learningProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/610(5517-5525)Online publication date: 3-Aug-2024
  • (2024)MGR-Dark: A Large Multimodal Video Dataset and RGB-IR Benchmark for Gesture Recognition in DarknessProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681267(2321-2330)Online publication date: 28-Oct-2024
  • (2024)A teacher–student deep learning strategy for extreme low resolution unsafe action recognition in construction projectsAdvanced Engineering Informatics10.1016/j.aei.2023.10229459:COnline publication date: 1-Jan-2024
  • (2024)A human activity recognition framework in videos using segmented human subject focusThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-023-03256-440:10(6983-6999)Online publication date: 1-Oct-2024
  • (2023)KD-zeroProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669165(69490-69504)Online publication date: 10-Dec-2023
  • (2023)Auxiliary modality learning with generalized curriculum distillationProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619694(31057-31076)Online publication date: 23-Jul-2023
  • (2023)On uni-modal feature learning in supervised multi-modal learningProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618753(8632-8656)Online publication date: 23-Jul-2023
  • (2023)Egocentric Early Action Prediction via Adversarial Knowledge DistillationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/354449319:2(1-21)Online publication date: 6-Feb-2023
  • (2022)Shadow knowledge distillationProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3600316(635-649)Online publication date: 28-Nov-2022
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media