Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-030-58452-8_41guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video

Published: 23 August 2020 Publication History

Abstract

We address the challenging task of anticipating human-object interaction in first person videos. Most existing methods either ignore how the camera wearer interacts with objects, or simply considers body motion as a separate modality. In contrast, we observe that the intentional hand movement reveals critical information about the future activity. Motivated by this observation, we adopt intentional hand movement as a feature representation, and propose a novel deep network that jointly models and predicts the egocentric hand motion, interaction hotspots and future action. Specifically, we consider the future hand motion as the motor attention, and model this attention using probabilistic variables in our deep model. The predicted motor attention is further used to select the discriminative spatial-temporal visual features for predicting actions and interaction hotspots. We present extensive experiments demonstrating the benefit of the proposed joint model. Importantly, our model produces new state-of-the-art results for action anticipation on both EGTEA Gaze+ and the EPIC-Kitchens datasets. Our project page is available at https://aptx4869lm.github.io/ForecastingHOI/.

References

[1]
Aglioti SM, Cesari P, Romani M, and Urgesi C Action anticipation and motor resonance in elite basketball players Nat. Neurosci. 2008 11 9 1109
[2]
Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: ICCV (2019)
[3]
Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: human trajectory prediction in crowded spaces. In: CVPR (2016)
[4]
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)
[5]
Chen CY and Grauman K Subjects and their objects: localizing interactees for a person-centric view of importance Int. J. Comput. Vision 2018 126 2–4 292-313
[6]
Damen D et al. Ferrari V, Hebert M, Sminchisescu C, Weiss Y, et al. Scaling egocentric vision: the epic-kitchens dataset Computer Vision – ECCV 2018 2018 Cham Springer 753-771
[7]
Delaitre V, Fouhey DF, Laptev I, Sivic J, Gupta A, and Efros AA Fitzgibbon A, Lazebnik S, Perona P, Sato Y, and Schmid C Scene semantics from long-term observation of people Computer Vision – ECCV 2012 2012 Heidelberg Springer 284-298
[8]
di Pellegrino G, Fadiga L, Fogassi L, Gallese V, and Rizzolatti G Understanding motor events: a neurophysiological study Exp. Brain Res. 1992 91 176-180
[9]
Fang, K., Wu, T.L., Yang, D., Savarese, S., Lim, J.J.: Demo2Vec: reasoning object affordances from online videos. In: CVPR (2018)
[10]
Fathi, A., Farhadi, A., Rehg, J.M.: Understanding egocentric activities. In: ICCV (2011)
[11]
Felsen, P., Agrawal, P., Malik, J.: What will happen next? Forecasting player moves in sports videos. In: ICCV (2017)
[12]
Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)
[13]
Furnari A, Battiato S, Grauman K, and Farinella GM Next-active-object prediction from egocentric videos J. Vis. Commun. Image Represent. 2017 49 401-411
[14]
Furnari A, Battiato S, and Farinella GM Leal-Taixé L and Roth S Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation Computer Vision – ECCV 2018 Workshops 2019 Cham Springer 389-405
[15]
Furnari, A., Farinella, G.M.: What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In: ICCV (2019)
[16]
Gao, J., Yang, Z., Nevatia, R.: Red: reinforced encoder-decoder networks for action anticipation. In: BMVC (2017)
[17]
Ghadiyaram, D., Tran, D., Mahajan, D.: Large-scale weakly-supervised pre-training for video action recognition. In: CVPR (2019)
[18]
Grabner, H., Gall, J., Van Gool, L.: What makes a chair a chair? In: CVPR (2011)
[19]
Gui L-Y, Wang Y-X, Liang X, and Moura JMF Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Adversarial geometry-aware human motion prediction Computer Vision – ECCV 2018 2018 Cham Springer 823-842
[20]
Hari R, Forss N, Avikainen S, Kirveskari E, Salenius S, and Rizzolatti G Activation of human primary motor cortex during action observation: a neuromagnetic study Proc. Natl. Acad. Sci. 1998 95 25 15061-15065
[21]
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
[22]
Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr, P.: Deeply supervised salient object detection with short connections. In: CVPR (2017)
[23]
Huang Y, Cai M, Li Z, and Sato Y Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Predicting gaze in egocentric video by learning task-dependent attention transition Computer Vision – ECCV 2018 2018 Cham Springer 789-804
[24]
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
[25]
James W, Burkhardt F, Bowers F, and Skrupskelis IK The Principles of Psychology 1890 London Macmillan
[26]
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-Softmax. In: ICLR (2017)
[27]
Kataoka, H., Miyashita, Y., Hayashi, M., Iwata, K., Satoh, Y.: Recognition of transitional action for short-term action prediction using discriminative temporal CNN feature. In: BMVC (2016)
[28]
Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In: CVPR (2019)
[29]
Kitani KM, Ziebart BD, Bagnell JA, and Hebert M Fitzgibbon A, Lazebnik S, Perona P, Sato Y, and Schmid C Activity forecasting Computer Vision – ECCV 2012 2012 Heidelberg Springer 201-214
[30]
Kong, Y., Fu, Y.: Human action recognition and prediction: a survey. arXiv preprint arXiv:1806.11230 (2018)
[31]
Koppula HS and Saxena A Anticipating human activities using object affordances for reactive robotic response IEEE Trans. Pattern Anal. Mach. Intell. 2015 38 1 14-29
[32]
Li Y, Liu M, and Rehg JM Ferrari V, Hebert M, Sminchisescu C, and Weiss Y In the eye of beholder: joint learning of gaze and actions in first person video Computer Vision – ECCV 2018 2018 Cham Springer 639-655
[33]
Li, Y., Liu, M., Rehg, J.M.: In the eye of the beholder: gaze and actions in first person video. arXiv preprint arXiv:2006.00626 (2020)
[34]
Li, Y., Ye, Z., Rehg, J.M.: Delving into egocentric actions. In: CVPR (2015)
[35]
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
[36]
Ma, M., Fan, H., Kitani, K.M.: Going deeper into first-person activity recognition. In: CVPR (2016)
[37]
Maddison, C.J., Mnih, A., Teh, Y.W.: The concrete distribution: a continuous relaxation of discrete random variables. In: ICLR (2017)
[38]
Miech, A., Laptev, I., Sivic, J., Wang, H., Torresani, L., Tran, D.: Leveraging the present to anticipate the future in videos. In: CVPR Workshops (2019)
[39]
Nagarajan, T., Feichtenhofer, C., Grauman, K.: Grounded human-object interaction hotspots from video. In: ICCV (2019)
[40]
Pavlovic V, Rehg JM, and MacCormick J Leen TK, Dietterich TG, and Tresp V Learning switching linear models of human motion NeurIPS 2001 Cambridge MIT Press 981-987
[41]
Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: CVPR (2012)
[42]
Poleg, Y., Ephrat, A., Peleg, S., Arora, C.: Compact CNN for indexing egocentric videos. In: WACV (2016)
[43]
Rhinehart, N., Kitani, K.M.: Learning action maps of large environments via first-person vision. In: CVPR (2016)
[44]
Rhinehart, N., Kitani, K.M.: First-person activity forecasting with online inverse reinforcement learning. In: ICCV (2017)
[45]
Rushworth M, Johansen-Berg H, Göbel SM, and Devlin J The left parietal and premotor cortices: motor attention and selection Neuroimage 2003 20 S89-S100
[46]
Ryoo, M.S., Rothrock, B., Matthies, L.: Pooled motion features for first-person videos. In: CVPR (2015)
[47]
Ryoo, M., Fuchs, T.J., Xia, L., Aggarwal, J.K., Matthies, L.: Robot-centric activity prediction from first-person videos: what will they do to me? In: HRI (2015)
[48]
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: ICCV (2017)
[49]
Shen Y, Ni B, Li Z, and Zhuang N Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Egocentric activity prediction via event modulated attention Computer Vision – ECCV 2018 2018 Cham Springer 202-217
[50]
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
[51]
Soo Park, H., Hwang, J.J., Niu, Y., Shi, J.: Egocentric future localization. In: CVPR (2016)
[52]
Soran, B., Farhadi, A., Shapiro, L.: Generating notifications for missing actions: don’t forget to turn the lights off! In: ICCV (2015)
[53]
Thermos, S., Papadopoulos, G.T., Daras, P., Potamianos, G.: Deep affordance-grounded sensorimotor object recognition. In: CVPR (2017)
[54]
Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel-separated convolutional networks. In: ICCV (2019)
[55]
Urtasun, R., Fleet, D.J., Geiger, A., Popović, J., Darrell, T.J., Lawrence, N.D.: Topologically-constrained latent variable models. In: ICML (2008)
[56]
Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: CVPR (2016)
[57]
Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: ICCV (2017)
[58]
Wang JM, Fleet DJ, and Hertzmann A Gaussian process dynamical models for human motion IEEE Trans. Pattern Anal. Mach. Intell. 2007 30 2 283-298
[59]
Wang, X., Girdhar, R., Gupta, A.: Binge watching: scaling affordance learning from sitcoms. In: CVPR (2017)
[60]
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)
[61]
Wei, P., Xie, D., Zheng, N., Zhu, S.C.: Inferring human attention by learning latent intentions. In: IJCAI (2017)
[62]
Yagi, T., Mangalam, K., Yonetani, R., Sato, Y.: Future person localization in first-person videos. In: CVPR (2018)
[63]
Zhou, Y., Ni, B., Hong, R., Yang, X., Tian, Q.: Cascaded interactional targeting network for egocentric video analysis. In: CVPR (2016)

Cited By

View all
  • (2024)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 11-Jan-2024
  • (2024)AFF-ttention! Affordances and Attention Models for Short-Term Object Interaction AnticipationComputer Vision – ECCV 202410.1007/978-3-031-73337-6_10(167-184)Online publication date: 29-Sep-2024
  • (2024)Bidirectional Progressive Transformer for Interaction Intention AnticipationComputer Vision – ECCV 202410.1007/978-3-031-73202-7_4(57-75)Online publication date: 29-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I
Aug 2020
855 pages
ISBN:978-3-030-58451-1
DOI:10.1007/978-3-030-58452-8

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 August 2020

Author Tags

  1. First Person Vision
  2. Action anticipation
  3. Motor attention

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 11-Jan-2024
  • (2024)AFF-ttention! Affordances and Attention Models for Short-Term Object Interaction AnticipationComputer Vision – ECCV 202410.1007/978-3-031-73337-6_10(167-184)Online publication date: 29-Sep-2024
  • (2024)Bidirectional Progressive Transformer for Interaction Intention AnticipationComputer Vision – ECCV 202410.1007/978-3-031-73202-7_4(57-75)Online publication date: 29-Sep-2024
  • (2024)Gated Temporal Diffusion for Stochastic Long-Term Dense AnticipationComputer Vision – ECCV 202410.1007/978-3-031-73001-6_26(454-472)Online publication date: 29-Sep-2024
  • (2024)Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role UnderstandingComputer Vision – ECCV 202410.1007/978-3-031-72989-8_1(1-19)Online publication date: 29-Sep-2024
  • (2024)Early Anticipation of Driving ManeuversComputer Vision – ECCV 202410.1007/978-3-031-72897-6_9(152-169)Online publication date: 29-Sep-2024
  • (2024)Spherical World-Locking for Audio-Visual Localization in Egocentric VideosComputer Vision – ECCV 202410.1007/978-3-031-72691-0_15(256-274)Online publication date: 29-Sep-2024
  • (2024)LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction TuningComputer Vision – ECCV 202410.1007/978-3-031-72673-6_8(135-155)Online publication date: 29-Sep-2024
  • (2023)Iterative Learning with Extra and Inner Knowledge for Long-tail Dynamic Scene Graph GenerationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612430(4707-4715)Online publication date: 26-Oct-2023
  • (2022)Generative Adversarial Network for Future Hand Segmentation from Egocentric VideoComputer Vision – ECCV 202210.1007/978-3-031-19778-9_37(639-656)Online publication date: 23-Oct-2022
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media