Article

Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video

Authors:

James M. RehgAuthors Info & Claims

Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I

Pages 704 - 721

https://doi.org/10.1007/978-3-030-58452-8_41

Published: 23 August 2020 Publication History

Abstract

We address the challenging task of anticipating human-object interaction in first person videos. Most existing methods either ignore how the camera wearer interacts with objects, or simply considers body motion as a separate modality. In contrast, we observe that the intentional hand movement reveals critical information about the future activity. Motivated by this observation, we adopt intentional hand movement as a feature representation, and propose a novel deep network that jointly models and predicts the egocentric hand motion, interaction hotspots and future action. Specifically, we consider the future hand motion as the motor attention, and model this attention using probabilistic variables in our deep model. The predicted motor attention is further used to select the discriminative spatial-temporal visual features for predicting actions and interaction hotspots. We present extensive experiments demonstrating the benefit of the proposed joint model. Importantly, our model produces new state-of-the-art results for action anticipation on both EGTEA Gaze+ and the EPIC-Kitchens datasets. Our project page is available at https://aptx4869lm.github.io/ForecastingHOI/.

References

[1]

Aglioti SM, Cesari P, Romani M, and Urgesi C Action anticipation and motor resonance in elite basketball players Nat. Neurosci. 2008 11 9 1109

[2]

Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: ICCV (2019)

[3]

Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: human trajectory prediction in crowded spaces. In: CVPR (2016)

[4]

Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)

[5]

Chen CY and Grauman K Subjects and their objects: localizing interactees for a person-centric view of importance Int. J. Comput. Vision 2018 126 2–4 292-313

Digital Library

[6]

Damen D et al. Ferrari V, Hebert M, Sminchisescu C, Weiss Y, et al. Scaling egocentric vision: the epic-kitchens dataset Computer Vision – ECCV 2018 2018 Cham Springer 753-771

Digital Library

[7]

Delaitre V, Fouhey DF, Laptev I, Sivic J, Gupta A, and Efros AA Fitzgibbon A, Lazebnik S, Perona P, Sato Y, and Schmid C Scene semantics from long-term observation of people Computer Vision – ECCV 2012 2012 Heidelberg Springer 284-298

Digital Library

[8]

di Pellegrino G, Fadiga L, Fogassi L, Gallese V, and Rizzolatti G Understanding motor events: a neurophysiological study Exp. Brain Res. 1992 91 176-180

[9]

Fang, K., Wu, T.L., Yang, D., Savarese, S., Lim, J.J.: Demo2Vec: reasoning object affordances from online videos. In: CVPR (2018)

[10]

Fathi, A., Farhadi, A., Rehg, J.M.: Understanding egocentric activities. In: ICCV (2011)

[11]

Felsen, P., Agrawal, P., Malik, J.: What will happen next? Forecasting player moves in sports videos. In: ICCV (2017)

[12]

Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)

[13]

Furnari A, Battiato S, Grauman K, and Farinella GM Next-active-object prediction from egocentric videos J. Vis. Commun. Image Represent. 2017 49 401-411

Digital Library

[14]

Furnari A, Battiato S, and Farinella GM Leal-Taixé L and Roth S Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation Computer Vision – ECCV 2018 Workshops 2019 Cham Springer 389-405

Digital Library

[15]

Furnari, A., Farinella, G.M.: What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In: ICCV (2019)

[16]

Gao, J., Yang, Z., Nevatia, R.: Red: reinforced encoder-decoder networks for action anticipation. In: BMVC (2017)

[17]

Ghadiyaram, D., Tran, D., Mahajan, D.: Large-scale weakly-supervised pre-training for video action recognition. In: CVPR (2019)

[18]

Grabner, H., Gall, J., Van Gool, L.: What makes a chair a chair? In: CVPR (2011)

[19]

Gui L-Y, Wang Y-X, Liang X, and Moura JMF Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Adversarial geometry-aware human motion prediction Computer Vision – ECCV 2018 2018 Cham Springer 823-842

Digital Library

[20]

Hari R, Forss N, Avikainen S, Kirveskari E, Salenius S, and Rizzolatti G Activation of human primary motor cortex during action observation: a neuromagnetic study Proc. Natl. Acad. Sci. 1998 95 25 15061-15065

[21]

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

[22]

Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr, P.: Deeply supervised salient object detection with short connections. In: CVPR (2017)

[23]

Huang Y, Cai M, Li Z, and Sato Y Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Predicting gaze in egocentric video by learning task-dependent attention transition Computer Vision – ECCV 2018 2018 Cham Springer 789-804

Digital Library

[24]

Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)

[25]

James W, Burkhardt F, Bowers F, and Skrupskelis IK The Principles of Psychology 1890 London Macmillan

[26]

Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-Softmax. In: ICLR (2017)

[27]

Kataoka, H., Miyashita, Y., Hayashi, M., Iwata, K., Satoh, Y.: Recognition of transitional action for short-term action prediction using discriminative temporal CNN feature. In: BMVC (2016)

[28]

Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In: CVPR (2019)

[29]

Kitani KM, Ziebart BD, Bagnell JA, and Hebert M Fitzgibbon A, Lazebnik S, Perona P, Sato Y, and Schmid C Activity forecasting Computer Vision – ECCV 2012 2012 Heidelberg Springer 201-214

Digital Library

[30]

Kong, Y., Fu, Y.: Human action recognition and prediction: a survey. arXiv preprint arXiv:1806.11230 (2018)

[31]

Koppula HS and Saxena A Anticipating human activities using object affordances for reactive robotic response IEEE Trans. Pattern Anal. Mach. Intell. 2015 38 1 14-29

Digital Library

[32]

Li Y, Liu M, and Rehg JM Ferrari V, Hebert M, Sminchisescu C, and Weiss Y In the eye of beholder: joint learning of gaze and actions in first person video Computer Vision – ECCV 2018 2018 Cham Springer 639-655

Digital Library

[33]

Li, Y., Liu, M., Rehg, J.M.: In the eye of the beholder: gaze and actions in first person video. arXiv preprint arXiv:2006.00626 (2020)

[34]

Li, Y., Ye, Z., Rehg, J.M.: Delving into egocentric actions. In: CVPR (2015)

[35]

Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)

[36]

Ma, M., Fan, H., Kitani, K.M.: Going deeper into first-person activity recognition. In: CVPR (2016)

[37]

Maddison, C.J., Mnih, A., Teh, Y.W.: The concrete distribution: a continuous relaxation of discrete random variables. In: ICLR (2017)

[38]

Miech, A., Laptev, I., Sivic, J., Wang, H., Torresani, L., Tran, D.: Leveraging the present to anticipate the future in videos. In: CVPR Workshops (2019)

[39]

Nagarajan, T., Feichtenhofer, C., Grauman, K.: Grounded human-object interaction hotspots from video. In: ICCV (2019)

[40]

Pavlovic V, Rehg JM, and MacCormick J Leen TK, Dietterich TG, and Tresp V Learning switching linear models of human motion NeurIPS 2001 Cambridge MIT Press 981-987

[41]

Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: CVPR (2012)

[42]

Poleg, Y., Ephrat, A., Peleg, S., Arora, C.: Compact CNN for indexing egocentric videos. In: WACV (2016)

[43]

Rhinehart, N., Kitani, K.M.: Learning action maps of large environments via first-person vision. In: CVPR (2016)

[44]

Rhinehart, N., Kitani, K.M.: First-person activity forecasting with online inverse reinforcement learning. In: ICCV (2017)

[45]

Rushworth M, Johansen-Berg H, Göbel SM, and Devlin J The left parietal and premotor cortices: motor attention and selection Neuroimage 2003 20 S89-S100

[46]

Ryoo, M.S., Rothrock, B., Matthies, L.: Pooled motion features for first-person videos. In: CVPR (2015)

[47]

Ryoo, M., Fuchs, T.J., Xia, L., Aggarwal, J.K., Matthies, L.: Robot-centric activity prediction from first-person videos: what will they do to me? In: HRI (2015)

[48]

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: ICCV (2017)

[49]

Shen Y, Ni B, Li Z, and Zhuang N Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Egocentric activity prediction via event modulated attention Computer Vision – ECCV 2018 2018 Cham Springer 202-217

Digital Library

[50]

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)

[51]

Soo Park, H., Hwang, J.J., Niu, Y., Shi, J.: Egocentric future localization. In: CVPR (2016)

[52]

Soran, B., Farhadi, A., Shapiro, L.: Generating notifications for missing actions: don’t forget to turn the lights off! In: ICCV (2015)

[53]

Thermos, S., Papadopoulos, G.T., Daras, P., Potamianos, G.: Deep affordance-grounded sensorimotor object recognition. In: CVPR (2017)

[54]

Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel-separated convolutional networks. In: ICCV (2019)

[55]

Urtasun, R., Fleet, D.J., Geiger, A., Popović, J., Darrell, T.J., Lawrence, N.D.: Topologically-constrained latent variable models. In: ICML (2008)

[56]

Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: CVPR (2016)

[57]

Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: ICCV (2017)

[58]

Wang JM, Fleet DJ, and Hertzmann A Gaussian process dynamical models for human motion IEEE Trans. Pattern Anal. Mach. Intell. 2007 30 2 283-298

Digital Library

[59]

Wang, X., Girdhar, R., Gupta, A.: Binge watching: scaling affordance learning from sitcoms. In: CVPR (2017)

[60]

Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)

[61]

Wei, P., Xie, D., Zheng, N., Zhu, S.C.: Inferring human attention by learning latent intentions. In: IJCAI (2017)

[62]

Yagi, T., Mangalam, K., Yonetani, R., Sato, Y.: Future person localization in first-person videos. In: CVPR (2018)

[63]

Zhou, Y., Ni, B., Hong, R., Yang, X., Tian, Q.: Cascaded interactional targeting network for egocentric video analysis. In: CVPR (2016)

Cited By

Zhang TMin WLiu TJiang SRui Y(2024)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633333
Mur-Labadia LMartinez-Cantin RGuerrero JFarinella GFurnari A(2024)AFF-ttention! Affordances and Attention Models for Short-Term Object Interaction AnticipationComputer Vision – ECCV 202410.1007/978-3-031-73337-6_10(167-184)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73337-6_10
Zhang ZLuo HZhai WCao YKang Y(2024)Bidirectional Progressive Transformer for Interaction Intention AnticipationComputer Vision – ECCV 202410.1007/978-3-031-73202-7_4(57-75)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73202-7_4
Show More Cited By

Recommendations

Leveraging Uncertainty to Rethink Loss Functions and Evaluation Measures for Egocentric Action Anticipation
Computer Vision – ECCV 2018 Workshops
Abstract
Current action anticipation approaches often neglect the intrinsic uncertainty of future predictions when loss functions or evaluation measures are designed. The uncertainty of future observations is especially relevant in the context of ...
Predicting the future from first person (egocentric) vision: A survey
Abstract
Egocentric videos can bring a lot of information about how humans perceive the world and interact with the environment, which can be beneficial for the analysis of human behaviour. The research in egocentric video analysis is ...
Highlights
- We review tasks, datasets and models for future predictions from egocentric vision.
In the Eye of the Beholder: Gaze and Actions in First Person Video
We address the task of jointly determining what a person is doing and where they are looking based on the analysis of video captured by a headworn camera. To facilitate our research, we first introduce the EGTEA Gaze+ dataset. Our dataset comes with ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I

Aug 2020

855 pages

ISBN:978-3-030-58451-1

DOI:10.1007/978-3-030-58452-8

Editors:
Andrea Vedaldi
University of Oxford, Oxford, UK
,
Horst Bischof
Graz University of Technology, Graz, Austria
,
Thomas Brox
University of Freiburg, Freiburg im Breisgau, Germany
,
Jan-Michael Frahm
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

© Springer Nature Switzerland AG 2020.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 August 2020

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang TMin WLiu TJiang SRui Y(2024)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633333
Mur-Labadia LMartinez-Cantin RGuerrero JFarinella GFurnari A(2024)AFF-ttention! Affordances and Attention Models for Short-Term Object Interaction AnticipationComputer Vision – ECCV 202410.1007/978-3-031-73337-6_10(167-184)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73337-6_10
Zhang ZLuo HZhai WCao YKang Y(2024)Bidirectional Progressive Transformer for Interaction Intention AnticipationComputer Vision – ECCV 202410.1007/978-3-031-73202-7_4(57-75)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73202-7_4
Zatsarynna OBahrami EFarha YFrancesca GGall J(2024)Gated Temporal Diffusion for Stochastic Long-Term Dense AnticipationComputer Vision – ECCV 202410.1007/978-3-031-73001-6_26(454-472)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73001-6_26
Tran MKim YSu CKuo CSun MSoleymani M(2024)Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role UnderstandingComputer Vision – ECCV 202410.1007/978-3-031-72989-8_1(1-19)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72989-8_1
Wasi AGangisetty SRai SJawahar C(2024)Early Anticipation of Driving ManeuversComputer Vision – ECCV 202410.1007/978-3-031-72897-6_9(152-169)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72897-6_9
Yun HGao RAnanthabhotla IKumar ADonley JLi CKim GIthapu VMurdock C(2024)Spherical World-Locking for Audio-Visual Localization in Egocentric VideosComputer Vision – ECCV 202410.1007/978-3-031-72691-0_15(256-274)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72691-0_15
Lai BDai XChen LPang GRehg JLiu M(2024)LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction TuningComputer Vision – ECCV 202410.1007/978-3-031-72673-6_8(135-155)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72673-6_8
Li YYang XXu CEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Iterative Learning with Extra and Inner Knowledge for Long-tail Dynamic Scene Graph GenerationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612430(4707-4715)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612430
Jia WLiu MRehg J(2022)Generative Adversarial Network for Future Hand Segmentation from Egocentric VideoComputer Vision – ECCV 202210.1007/978-3-031-19778-9_37(639-656)Online publication date: 23-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-19778-9_37
Show More Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten