Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-030-58517-4_10guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Temporal Aggregate Representations for Long-Range Video Understanding

Published: 23 August 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Future prediction, especially in long-range videos, requires reasoning from current and past observations. In this work, we address questions of temporal extent, scaling, and level of semantic abstraction with a flexible multi-granular temporal aggregation framework. We show that it is possible to achieve state of the art in both next action and dense anticipation with simple techniques such as max-pooling and attention. To demonstrate the anticipation capabilities of our model, we conduct experiments on Breakfast, 50Salads, and EPIC-Kitchens datasets, where we achieve state-of-the-art results. With minimal modifications, our model can also be extended for video segmentation and action recognition.

    References

    [1]
    Abu Farha, Y., Richard, A., Gall, J.: When will you do what? - Anticipating temporal occurrences of activities. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5343–5352 (2018)
    [2]
    Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
    [3]
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6299–6308 (2017)
    [4]
    Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 670–680 (2017)
    [5]
    Damen D et al. Ferrari V, Hebert M, Sminchisescu C, Weiss Y, et al. Scaling egocentric vision: the dataset Computer Vision – ECCV 2018 2018 Cham Springer 753-771
    [6]
    Ding, L., Xu, C.: Weakly-supervised action segmentation with iterative soft boundary assignment. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6508–6516 (2018)
    [7]
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2625–2634 (2015)
    [8]
    Farha, Y.A., Gall, J.: MS-TCN: multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3575–3584 (2019)
    [9]
    Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211 (2019)
    [10]
    Furnari, A., Farinella, G.M.: What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In: International Conference on Computer Vision (ICCV) (2019)
    [11]
    Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.: ActionVLAD: learning spatio-temporal aggregation for action classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 971–980 (2017)
    [12]
    Huang D-A, Fei-Fei L, and Niebles JC Leibe B, Matas J, Sebe N, and Welling M Connectionist temporal modeling for weakly supervised action labeling Computer Vision – ECCV 2016 2016 Cham Springer 137-153
    [13]
    Huang, D.A., et al.: What makes a video a video: analyzing temporal information in video understanding models and datasets. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7366–7375 (2018)
    [14]
    Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
    [15]
    Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
    [16]
    Kline, N., Snodgrass, R.T.: Computing temporal aggregates. In: Proceedings of the Eleventh International Conference on Data Engineering, pp. 222–231. IEEE (1995)
    [17]
    Koppula HS and Saxena A Anticipating human activities using object affordances for reactive robotic response IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 2015 38 1 14-29
    [18]
    Kuehne, H., Arslan, A., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 780–787 (2014)
    [19]
    Lan T, Chen T-C, and Savarese S Fleet D, Pajdla T, Schiele B, and Tuytelaars T A hierarchical representation for future action prediction Computer Vision – ECCV 2014 2014 Cham Springer 689-704
    [20]
    Lee J, Natsev AP, Reade W, Sukthankar R, and Toderici G Leal-Taixé L and Roth S The 2nd YouTube-8M large-scale video understanding challenge Computer Vision – ECCV 2018 Workshops 2019 Cham Springer 193-205
    [21]
    Li, F., et al.: Temporal modeling approaches for large-scale Youtube-8M video understanding. arXiv preprint arXiv:1707.04555 (2017)
    [22]
    Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 7083–7093 (2019)
    [23]
    Lin R, Xiao J, and Fan J Leal-Taixé L and Roth S NeXtVLAD: an efficient neural network to aggregate frame-level features for large-scale video classification Computer Vision – ECCV 2018 Workshops 2019 Cham Springer 206-218
    [24]
    Mahmud, T., Hasan, M., Roy-Chowdhury, A.K.: Joint prediction of activity labels and starting times in untrimmed videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5773–5782 (2017)
    [25]
    Miech, A., Laptev, I., Sivic, J.: Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905 (2017)
    [26]
    Miech, A., Laptev, I., Sivic, J., Wang, H., Torresani, L., Tran, D.: Leveraging the present to anticipate the future in videos. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, p. 0 (2019)
    [27]
    Ostyakov P et al. Leal-Taixé L, Roth S, et al. Label denoising with large ensembles of heterogeneous neural networks Computer Vision – ECCV 2018 Workshops 2019 Cham Springer 250-261
    [28]
    Richard, A., Gall, J.: Temporal action detection using a statistical language model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3131–3140 (2016)
    [29]
    Richard, A., Kuehne, H., Gall, J.: Weakly supervised action learning with RNN based fine-to-coarse modeling. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 754–763 (2017)
    [30]
    Sener, F., Yao, A.: Unsupervised learning and segmentation of complex activities from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
    [31]
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
    [32]
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012). http://arxiv.org/abs/1212.0402
    [33]
    Srivastava N, Hinton G, Krizhevsky A, Sutskever I, and Salakhutdinov R Dropout: a simple way to prevent neural networks from overfitting J. Mach. Learn. Res. 2014 15 1 1929-1958
    [34]
    Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738. ACM (2013)
    [35]
    Tang Y, Zhang X, Wang J, Chen S, Ma L, and Jiang Y-G Leal-Taixé L and Roth S Non-local NetVLAD encoding for video classification Computer Vision – ECCV 2018 Workshops 2019 Cham Springer 219-228
    [36]
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497 (2015)
    [37]
    Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6450–6459 (2018)
    [38]
    Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 98–106 (2016)
    [39]
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3551–3558 (2013)
    [40]
    Wang L et al. Leibe B, Matas J, Sebe N, Welling M, et al. Temporal segment networks: towards good practices for deep action recognition Computer Vision – ECCV 2016 2016 Cham Springer 20-36
    [41]
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803 (2018)
    [42]
    Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krähenbühl, P., Girshick, R.: Long-term feature banks for detailed video understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
    [43]
    Xiao, F., Lee, Y.J., Grauman, K., Malik, J., Feichtenhofer, C.: Audiovisual SlowFast networks for video recognition. arXiv preprint arXiv:2001.08740 (2020)
    [44]
    Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4694–4702 (2015)

    Cited By

    View all
    • (2024)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 11-Jan-2024
    • (2022)ConTra: (Con)text (Tra)nsformer for Cross-Modal Video RetrievalComputer Vision – ACCV 202210.1007/978-3-031-26316-3_27(451-468)Online publication date: 4-Dec-2022
    • (2022)Graphing the Future: Activity and Next Active Object Prediction Using Graph-Based Activity RepresentationsAdvances in Visual Computing10.1007/978-3-031-20713-6_23(299-312)Online publication date: 3-Oct-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI
    Aug 2020
    845 pages
    ISBN:978-3-030-58516-7
    DOI:10.1007/978-3-030-58517-4

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 23 August 2020

    Author Tags

    1. Action anticipation
    2. Temporal aggregation

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 11-Jan-2024
    • (2022)ConTra: (Con)text (Tra)nsformer for Cross-Modal Video RetrievalComputer Vision – ACCV 202210.1007/978-3-031-26316-3_27(451-468)Online publication date: 4-Dec-2022
    • (2022)Graphing the Future: Activity and Next Active Object Prediction Using Graph-Based Activity RepresentationsAdvances in Visual Computing10.1007/978-3-031-20713-6_23(299-312)Online publication date: 3-Oct-2022
    • (2022)Real-Time Online Video Detection with Temporal Smoothing TransformersComputer Vision – ECCV 202210.1007/978-3-031-19830-4_28(485-502)Online publication date: 23-Oct-2022
    • (2022)A Generalized and Robust Framework for Timestamp Supervision in Temporal Action SegmentationComputer Vision – ECCV 202210.1007/978-3-031-19772-7_17(279-296)Online publication date: 23-Oct-2022

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media