Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2964284.2967222acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Improved Dense Trajectory with Cross Streams

Published: 01 October 2016 Publication History

Abstract

Improved dense trajectories (iDT) have shown great performance in action recognition, and their combination with the two-stream approach has achieved state-of-the-art performance. It is, however, difficult for iDT to completely remove background trajectories from video with camera shaking. Trajectories in less discriminative regions should be given modest weights in order to create more discriminative local descriptors for action recognition. In addition, the two-stream approach, which learns appearance and motion information separately, cannot focus on motion in important regions when extracting features from spatial convolutional layers of the appearance network, and vice versa. In order to address the above mentioned problems, we propose a new local descriptor that pools a new convolutional layer obtained from crossing two networks along iDT. This new descriptor is calculated by applying discriminative weights learned from one network to a convolutional layer of the other network. Our method has achieved state-of-the-art performance on ordinal action recognition datasets, 92.3% on UCF101, and 66.2% on HMDB51.

References

[1]
N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
[2]
N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In ECCV, 2006.
[3]
C. Feichtenhofer, A. Pinz, and R. P. Wildes. Dynamically encoded actions based on spacetime saliency. In CVPR, 2015.
[4]
B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars. Modeling video evolution for action recognition. In CVPR, 2015.
[5]
C. Gan, N. Wang, Y. Yang, D.-Y. Yeung, and A. G. Hauptmann. Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR, 2015.
[6]
H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregating local descriptors into a compact image representation. In CVPR, 2010.
[7]
V. Kantorov and I. Laptev. Efficient feature extraction, encoding and classification for action recognition. In CVPR, 2014.
[8]
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In ICCV, 2011.
[9]
Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In CVPR, 2015.
[10]
B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. IJCAI, 81:674--679, 1981.
[11]
X. Peng, C. Zou, Y. Qiao, and Q. Peng. Action recognition with stacked sher vectors. In ECCV. 2014.
[12]
F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007.
[13]
K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
[14]
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
[15]
K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402, 2012.
[16]
L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In ICCV, 2015.
[17]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
[18]
H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR, 2011.
[19]
H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
[20]
L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, 2015.
[21]
L. Wang, Y. Xiong, Z. Wang, and Y. Qiao. Towards good practices for very deep two-stream convnets. arXiv:1507.02159, 2015.
[22]
X. Wang, A. Farhadi, and A. Gupta. Actions~ transformations. In CVPR, 2016.
[23]
Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue. Modeling spatial-temporal clues in a hybrid deep learning framework for video classi cation. In ACMMM, 2015.
[24]
Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative CNN video representation for event detection. In CVPR, 2015.
[25]
J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
[26]
S. Zha, F. Luisier, W. Andrews, N. Srivastava, and R. Salakhutdinov. Exploiting image-trained cnn architectures for unconstrained video classification. In BMVC, 2015.

Cited By

View all
  • (2024)Pose-Promote: Progressive Visual Perception for Activities of Daily LivingIEEE Signal Processing Letters10.1109/LSP.2024.348004631(2950-2954)Online publication date: 2024
  • (2024)3D Graph Convolutional Feature Selection and Dense Pre-Estimation for Skeleton Action RecognitionIEEE Access10.1109/ACCESS.2024.335362212(11733-11742)Online publication date: 2024
  • (2024)3sG: Three‐stage guidance for indoor human action recognitionIET Image Processing10.1049/ipr2.1307818:8(2000-2010)Online publication date: 15-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '16: Proceedings of the 24th ACM international conference on Multimedia
October 2016
1542 pages
ISBN:9781450336031
DOI:10.1145/2964284
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. action recognition
  2. local descriptor
  3. video representation

Qualifiers

  • Short-paper

Funding Sources

  • JST

Conference

MM '16
Sponsor:
MM '16: ACM Multimedia Conference
October 15 - 19, 2016
Amsterdam, The Netherlands

Acceptance Rates

MM '16 Paper Acceptance Rate 52 of 237 submissions, 22%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)2
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Pose-Promote: Progressive Visual Perception for Activities of Daily LivingIEEE Signal Processing Letters10.1109/LSP.2024.348004631(2950-2954)Online publication date: 2024
  • (2024)3D Graph Convolutional Feature Selection and Dense Pre-Estimation for Skeleton Action RecognitionIEEE Access10.1109/ACCESS.2024.335362212(11733-11742)Online publication date: 2024
  • (2024)3sG: Three‐stage guidance for indoor human action recognitionIET Image Processing10.1049/ipr2.1307818:8(2000-2010)Online publication date: 15-Mar-2024
  • (2023)Multimodal action recognition: a comprehensive survey on temporal modelingMultimedia Tools and Applications10.1007/s11042-023-17345-y83:20(59439-59489)Online publication date: 22-Dec-2023
  • (2022)Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV51458.2022.00092(846-856)Online publication date: Jan-2022
  • (2019)Feature Fusion of Deep Spatial Features and Handcrafted Spatiotemporal Features for Human Action RecognitionSensors10.3390/s1907159919:7(1599)Online publication date: 2-Apr-2019
  • (2019)Loss Switching Fusion with Similarity Search for Video Classification2019 IEEE International Conference on Image Processing (ICIP)10.1109/ICIP.2019.8803051(974-978)Online publication date: Sep-2019
  • (2019)Cross-Stream Selective Networks for Action Recognition2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW.2019.00059(454-460)Online publication date: Jun-2019
  • (2018)Action Recognition by Jointly Using Video Proposal and TrajectoryProceedings of the 2nd International Conference on Vision, Image and Signal Processing10.1145/3271553.3271563(1-7)Online publication date: 27-Aug-2018
  • (2018)Complex Behavior Recognition Based on Convolutional Neural Network: A Survey2018 14th International Conference on Mobile Ad-Hoc and Sensor Networks (MSN)10.1109/MSN.2018.00024(103-108)Online publication date: Dec-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media