Article

Temporal Aggregate Representations for Long-Range Video Understanding

Authors:

Dipika Singhania,

Angela YaoAuthors Info & Claims

Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI

Pages 154 - 171

https://doi.org/10.1007/978-3-030-58517-4_10

Published: 23 August 2020 Publication History

Abstract

Future prediction, especially in long-range videos, requires reasoning from current and past observations. In this work, we address questions of temporal extent, scaling, and level of semantic abstraction with a flexible multi-granular temporal aggregation framework. We show that it is possible to achieve state of the art in both next action and dense anticipation with simple techniques such as max-pooling and attention. To demonstrate the anticipation capabilities of our model, we conduct experiments on Breakfast, 50Salads, and EPIC-Kitchens datasets, where we achieve state-of-the-art results. With minimal modifications, our model can also be extended for video segmentation and action recognition.

References

[1]

Abu Farha, Y., Richard, A., Gall, J.: When will you do what? - Anticipating temporal occurrences of activities. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5343–5352 (2018)

[2]

Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

[3]

Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6299–6308 (2017)

[4]

Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 670–680 (2017)

[5]

Damen D et al. Ferrari V, Hebert M, Sminchisescu C, Weiss Y, et al. Scaling egocentric vision: the dataset Computer Vision – ECCV 2018 2018 Cham Springer 753-771

Digital Library

[6]

Ding, L., Xu, C.: Weakly-supervised action segmentation with iterative soft boundary assignment. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6508–6516 (2018)

[7]

Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2625–2634 (2015)

[8]

Farha, Y.A., Gall, J.: MS-TCN: multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3575–3584 (2019)

[9]

Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211 (2019)

[10]

Furnari, A., Farinella, G.M.: What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In: International Conference on Computer Vision (ICCV) (2019)

[11]

Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.: ActionVLAD: learning spatio-temporal aggregation for action classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 971–980 (2017)

[12]

Huang D-A, Fei-Fei L, and Niebles JC Leibe B, Matas J, Sebe N, and Welling M Connectionist temporal modeling for weakly supervised action labeling Computer Vision – ECCV 2016 2016 Cham Springer 137-153

[13]

Huang, D.A., et al.: What makes a video a video: analyzing temporal information in video understanding models and datasets. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7366–7375 (2018)

[14]

Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

[15]

Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

[16]

Kline, N., Snodgrass, R.T.: Computing temporal aggregates. In: Proceedings of the Eleventh International Conference on Data Engineering, pp. 222–231. IEEE (1995)

[17]

Koppula HS and Saxena A Anticipating human activities using object affordances for reactive robotic response IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 2015 38 1 14-29

Digital Library

[18]

Kuehne, H., Arslan, A., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 780–787 (2014)

[19]

Lan T, Chen T-C, and Savarese S Fleet D, Pajdla T, Schiele B, and Tuytelaars T A hierarchical representation for future action prediction Computer Vision – ECCV 2014 2014 Cham Springer 689-704

[20]

Lee J, Natsev AP, Reade W, Sukthankar R, and Toderici G Leal-Taixé L and Roth S The 2nd YouTube-8M large-scale video understanding challenge Computer Vision – ECCV 2018 Workshops 2019 Cham Springer 193-205

Digital Library

[21]

Li, F., et al.: Temporal modeling approaches for large-scale Youtube-8M video understanding. arXiv preprint arXiv:1707.04555 (2017)

[22]

Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 7083–7093 (2019)

[23]

Lin R, Xiao J, and Fan J Leal-Taixé L and Roth S NeXtVLAD: an efficient neural network to aggregate frame-level features for large-scale video classification Computer Vision – ECCV 2018 Workshops 2019 Cham Springer 206-218

Digital Library

[24]

Mahmud, T., Hasan, M., Roy-Chowdhury, A.K.: Joint prediction of activity labels and starting times in untrimmed videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5773–5782 (2017)

[25]

Miech, A., Laptev, I., Sivic, J.: Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905 (2017)

[26]

Miech, A., Laptev, I., Sivic, J., Wang, H., Torresani, L., Tran, D.: Leveraging the present to anticipate the future in videos. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, p. 0 (2019)

[27]

Ostyakov P et al. Leal-Taixé L, Roth S, et al. Label denoising with large ensembles of heterogeneous neural networks Computer Vision – ECCV 2018 Workshops 2019 Cham Springer 250-261

Digital Library

[28]

Richard, A., Gall, J.: Temporal action detection using a statistical language model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3131–3140 (2016)

[29]

Richard, A., Kuehne, H., Gall, J.: Weakly supervised action learning with RNN based fine-to-coarse modeling. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 754–763 (2017)

[30]

Sener, F., Yao, A.: Unsupervised learning and segmentation of complex activities from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

[31]

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)

[32]

Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012). http://arxiv.org/abs/1212.0402

[33]

Srivastava N, Hinton G, Krizhevsky A, Sutskever I, and Salakhutdinov R Dropout: a simple way to prevent neural networks from overfitting J. Mach. Learn. Res. 2014 15 1 1929-1958

Digital Library

[34]

Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738. ACM (2013)

[35]

Tang Y, Zhang X, Wang J, Chen S, Ma L, and Jiang Y-G Leal-Taixé L and Roth S Non-local NetVLAD encoding for video classification Computer Vision – ECCV 2018 Workshops 2019 Cham Springer 219-228

Digital Library

[36]

Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497 (2015)

[37]

Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6450–6459 (2018)

[38]

Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 98–106 (2016)

[39]

Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3551–3558 (2013)

[40]

Wang L et al. Leibe B, Matas J, Sebe N, Welling M, et al. Temporal segment networks: towards good practices for deep action recognition Computer Vision – ECCV 2016 2016 Cham Springer 20-36

[41]

Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803 (2018)

[42]

Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krähenbühl, P., Girshick, R.: Long-term feature banks for detailed video understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

[43]

Xiao, F., Lee, Y.J., Grauman, K., Malik, J., Feichtenhofer, C.: Audiovisual SlowFast networks for video recognition. arXiv preprint arXiv:2001.08740 (2020)

[44]

Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4694–4702 (2015)

Cited By

Zhang TMin WLiu TJiang SRui Y(2024)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633333
Fragomeni AWray MDamen D(2022)ConTra: (Con)text (Tra)nsformer for Cross-Modal Video RetrievalComputer Vision – ACCV 202210.1007/978-3-031-26316-3_27(451-468)Online publication date: 4-Dec-2022
https://dl.acm.org/doi/10.1007/978-3-031-26316-3_27
Manousaki VPapoutsakis KArgyros A(2022)Graphing the Future: Activity and Next Active Object Prediction Using Graph-Based Activity RepresentationsAdvances in Visual Computing10.1007/978-3-031-20713-6_23(299-312)Online publication date: 3-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-20713-6_23
Show More Cited By

Recommendations

Temporal and spatio-temporal aggregations over data streams using multiple time granularities
Special issue: Best papers from EDBT 2002

Temporal and spatio-temporal aggregations are important but costly operations for applications that maintain time-evolving data (data warehouses, temporal databases, etc.). In this paper, we examine the problem of computing such aggregates over data ...
A Greedy Approach Towards Parsimonious Temporal Aggregation
TIME '08: Proceedings of the 2008 15th International Symposium on Temporal Representation and Reasoning

Temporal aggregation is a crucial operator in temporal databases and has been studied in various flavors. In instant temporal aggregation (ITA) the aggregate value at time instant t is computed from the tuples that hold at t. ITA considers the ...
Scalable Algorithms for Large Temporal Aggregation
ICDE '00: Proceedings of the 16th International Conference on Data Engineering

The ability to model time-varying natures is essential to many database applications such as data warehousing and mining. However, the temporal aspects provide many unique characteristics and challenges for query processing and optimization. Among the ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI

Aug 2020

845 pages

ISBN:978-3-030-58516-7

DOI:10.1007/978-3-030-58517-4

Editors:
Andrea Vedaldi
University of Oxford, Oxford, UK
,
Horst Bischof
Graz University of Technology, Graz, Austria
,
Thomas Brox
University of Freiburg, Freiburg im Breisgau, Germany
,
Jan-Michael Frahm
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

© Springer Nature Switzerland AG 2020.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 August 2020

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang TMin WLiu TJiang SRui Y(2024)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633333
Fragomeni AWray MDamen D(2022)ConTra: (Con)text (Tra)nsformer for Cross-Modal Video RetrievalComputer Vision – ACCV 202210.1007/978-3-031-26316-3_27(451-468)Online publication date: 4-Dec-2022
https://dl.acm.org/doi/10.1007/978-3-031-26316-3_27
Manousaki VPapoutsakis KArgyros A(2022)Graphing the Future: Activity and Next Active Object Prediction Using Graph-Based Activity RepresentationsAdvances in Visual Computing10.1007/978-3-031-20713-6_23(299-312)Online publication date: 3-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-20713-6_23
Zhao YKrähenbühl P(2022)Real-Time Online Video Detection with Temporal Smoothing TransformersComputer Vision – ECCV 202210.1007/978-3-031-19830-4_28(485-502)Online publication date: 23-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-19830-4_28
Rahaman RSinghania DThiery AYao A(2022)A Generalized and Robust Framework for Timestamp Supervision in Temporal Action SegmentationComputer Vision – ECCV 202210.1007/978-3-031-19772-7_17(279-296)Online publication date: 23-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-19772-7_17

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents