Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3517031.3529628acmconferencesArticle/Chapter ViewAbstractPublication PagesetraConference Proceedingsconference-collections
research-article

Can Gaze Inform Egocentric Action Recognition?

Published: 08 June 2022 Publication History

Abstract

We investigate the hypothesis that gaze-signal can improve egocentric action recognition on the standard benchmark, EGTEA Gaze++ dataset. In contrast to prior work where gaze-signal was only used during training, we formulate a novel neural fusion approach, Cross-modality Attention Blocks (CMA), to leverage gaze-signal for action recognition during inference as well. CMA combines information from different modalities at different levels of abstraction to achieve state-of-the-art performance for egocentric action recognition. Specifically, fusing the video-stream with optical-flow with CMA outperforms the current state-of-the-art by 3%. However, when CMA is employed to fuse gaze-signal with video-stream data, no improvements are observed. Further investigation of this counter-intuitive finding indicates that small spatial overlap between the network’s attention-map and gaze ground-truth renders the gaze-signal uninformative for this benchmark. Based on our empirical findings, we recommend improvements to the current benchmark to develop practical systems for egocentric video understanding with gaze-signal.

Supplemental Material

MP4 File
Conference Presentation (ETRA Short Papers) of title={Can Gaze Inform Egocentric Action Recognition?}; authors={Zehua Zhang, David Crandall, Michael Proulx, Sachin Talathi, & Abhishek Sharma}; doi=10.1145/3517031.3529628

References

[1]
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space-Time Attention All You Need for Video Understanding?arXiv preprint arXiv:2102.05095(2021).
[2]
Anubhav Bhatti, Behnam Behinaein, Dirk Rodenburg, Paul Hungler, and Ali Etemad. 2021. Attentive Cross-modal Connections for Deep Multimodal Wearable-based Emotion Recognition. CoRR abs/2108.02241(2021).
[3]
Ali Borji and Laurent Itti. 2014. Defending Yarbus: Eye movements reveal observers’ task. Journal of vision 14, 3 (2014), 29–29.
[4]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213–229.
[5]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
[6]
Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. 2017. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1907–1915.
[7]
S de Vries, R Huys, and PG Zanone. 2018. Keeping your eye on the target: eye–hand coordination in a repetitive Fitts’ task. Experimental Brain Research 236, 12 (2018), 3181–3190.
[8]
Nemanja Djuric, Henggang Cui, Zhaoen Su, Shangxuan Wu, Huahua Wang, Fang-Chieh Chou, Luisa San Martin, Song Feng, Rui Hu, Yang Xu, 2020. Multixnet: Multiclass multistage multimodal motion prediction. arXiv preprint arXiv:2006.02000(2020).
[9]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929(2020).
[10]
Alireza Fathi, Yin Li, and James M Rehg. 2012. Learning to recognize daily actions using gaze. In European Conference on Computer Vision. Springer, 314–327.
[11]
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1933–1941.
[12]
Antonino Furnari and Giovanni Farinella. 2020. Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020), 1–1. https://doi.org/10.1109/TPAMI.2020.2992889
[13]
Mozhdeh Gheini, Xiang Ren, and Jonathan May. 2021. Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation. arxiv:2104.08771 [cs.CL]
[14]
Jacob Hadnett-Hunter, George Nicolaou, Eamonn O’Neill, and Michael Proulx. 2019. The effect of task on visual attention in interactive virtual environments. ACM Transactions on Applied Perception (TAP) 16, 3 (2019), 1–17.
[15]
John M Henderson, Svetlana V Shinkareva, Jing Wang, Steven G Luke, and Jenn Olejarczyk. 2013. Predicting cognitive state from eye movements. PloS one 8, 5 (2013), e64937.
[16]
Yifei Huang, Minjie Cai, Zhenqiang Li, Feng Lu, and Yoichi Sato. 2020. Mutual context network for jointly estimating egocentric gaze and action. IEEE Transactions on Image Processing 29 (2020), 7795–7806.
[17]
Shamsi T Iqbal and Brian P Bailey. 2004. Using eye gaze patterns to identify user tasks. In The Grace Hopper Celebration of Women in Computing, Vol. 4. 2004.
[18]
Georgios Kapidis, Ronald Poppe, Elsbeth van Dam, Lucas Noldus, and Remco Veltkamp. 2019. Multitask learning to improve egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0–0.
[19]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950(2017).
[20]
Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5492–5501.
[21]
Kris M Kitani, Takahiro Okabe, Yoichi Sato, and Akihiro Sugimoto. 2011. Fast unsupervised ego-action learning for first-person sports videos. In CVPR 2011. IEEE, 3241–3248.
[22]
Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven L Waslander. 2018. Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 1–8.
[23]
Michael F Land. 2006. Eye movements and the control of actions in everyday life. Progress in Retinal and Eye Research 25, 3 (2006), 296–324.
[24]
Yin Li, Miao Liu, and James M. Rehg. 2020. In the Eye of the Beholder: Gaze and Actions in First Person Video. arxiv:2006.00626 [cs.CV]
[25]
Ming Liang, Bin Yang, Yun Chen, Rui Hu, and Raquel Urtasun. 2019. Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7345–7353.
[26]
Ming Liang, Bin Yang, Wenyuan Zeng, Yun Chen, Rui Hu, Sergio Casas, and Raquel Urtasun. 2020. Pnpnet: End-to-end perception and prediction with tracking in the loop. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11553–11562.
[27]
Minlong Lu, Danping Liao, and Ze-Nian Li. 2019. Learning Spatiotemporal Attention for Egocentric Action Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops.
[28]
Wenjie Luo, Bin Yang, and Raquel Urtasun. 2018. Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 3569–3577.
[29]
Minghuang Ma, Haoqi Fan, and Kris M Kitani. 2016. Going deeper into first-person activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1894–1903.
[30]
Jonathan Samir Matthis, Jacob L Yates, and Mary M Hayhoe. 2018. Gaze and the control of foot placement when walking in natural terrain. Current Biology 28, 8 (2018), 1224–1233.
[31]
Kyle Min and Jason J. Corso. 2021. Integrating Human Gaze Into Attention for Egocentric Activity Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 1069–1078.
[32]
Satyam Mohla, Shivam Pande, Biplab Banerjee, and Subhasis Chaudhuri. 2020. FusAtNet: Dual Attention Based SpectroSpatial Multimodal Fusion Network for Hyperspectral and LiDAR Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
[33]
Ruth Rosenholtz. 2016. Capabilities and limitations of peripheral vision. Annual Review of Vision Science 2 (2016), 437–457.
[34]
Michael S Ryoo, Brandon Rothrock, and Larry Matthies. 2015. Pooled motion features for first-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 896–904.
[35]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199(2014).
[36]
Suriya Singh, Chetan Arora, and CV Jawahar. 2016. First person action recognition using deep learned descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2620–2628.
[37]
Ekaterina H Spriggs, Fernando De La Torre, and Martial Hebert. 2009. Temporal segmentation and activity classification from first-person sensing. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 17–24.
[38]
Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. 2019. Lsta: Long short-term attention for egocentric action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9954–9963.
[39]
Swathikiran Sudhakaran and Oswald Lanz. 2018. Attention is all we need: Nailing down object-centric attention for egocentric activity recognition. arXiv preprint arXiv:1807.11794(2018).
[40]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1–9.
[41]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762(2017).
[42]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794–7803.
[43]
Alfred L Yarbus. 2013. Eye movements and vision. Springer.
[44]
Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-Modal Self-Attention Network for Referring Image Segmentation. CoRR abs/1904.04745(2019).
[45]
Zehua Zhang, Sven Bambach, Chen Yu, and David J Crandall. 2018. From Coarse Attention to Fine-Grained Gaze: A Two-stage 3D Fully Convolutional Network for Predicting Eye Gaze in First Person Video. In British Machine Vision Conference (BMVC).
[46]
Zhishuai Zhang, Jiyang Gao, Junhua Mao, Yukai Liu, Dragomir Anguelov, and Congcong Li. 2020a. Stinet: Spatio-temporal-interactive network for pedestrian detection and trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11346–11355.
[47]
Zehua Zhang, Ashish Tawari, Sujitha Martin, and David Crandall. 2020b. Interaction Graphs for Object Importance Estimation in On-road Driving Videos. In 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 8920–8927.
[48]
Zehua Zhang, Chen Yu, and David Crandall. 2019. A Self Validation Network for Object-Level Human Attention Estimation. In Advances in Neural Information Processing Systems. 14702–14713.
[49]
B. Zhou, A. Khosla, Lapedriza. A., A. Oliva, and A. Torralba. 2016. Learning Deep Features for Discriminative Localization.CVPR (2016).

Cited By

View all
  • (2024)Real-World Scanpaths Exhibit Long-Term Temporal Dependencies: Considerations for Contextual AI for AR ApplicationsProceedings of the 2024 Symposium on Eye Tracking Research and Applications10.1145/3649902.3656352(1-7)Online publication date: 4-Jun-2024
  • (2024)EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02084(22072-22086)Online publication date: 16-Jun-2024
  • (2023)EgoHumans: An Egocentric 3D Multi-Human Benchmark2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01814(19750-19762)Online publication date: 1-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ETRA '22: 2022 Symposium on Eye Tracking Research and Applications
June 2022
408 pages
ISBN:9781450392525
DOI:10.1145/3517031
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. attention
  2. deep neural networks
  3. egocentric action recognition
  4. gaze

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Data Availability

Conference Presentation (ETRA Short Papers) of title={Can Gaze Inform Egocentric Action Recognition?}; authors={Zehua Zhang, David Crandall, Michael Proulx, Sachin Talathi, & Abhishek Sharma}; doi=10.1145/3517031.3529628 https://dl.acm.org/doi/10.1145/3517031.3529628#1019_full_talk_rong_yao - Yao Rong.mp4

Conference

ETRA '22

Acceptance Rates

ETRA '22 Paper Acceptance Rate 15 of 39 submissions, 38%;
Overall Acceptance Rate 69 of 137 submissions, 50%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)54
  • Downloads (Last 6 weeks)6
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Real-World Scanpaths Exhibit Long-Term Temporal Dependencies: Considerations for Contextual AI for AR ApplicationsProceedings of the 2024 Symposium on Eye Tracking Research and Applications10.1145/3649902.3656352(1-7)Online publication date: 4-Jun-2024
  • (2024)EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02084(22072-22086)Online publication date: 16-Jun-2024
  • (2023)EgoHumans: An Egocentric 3D Multi-Human Benchmark2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01814(19750-19762)Online publication date: 1-Oct-2023
  • (2023)EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00486(5250-5261)Online publication date: 1-Oct-2023
  • (2022)EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted DevicesComputer Vision – ECCV 202210.1007/978-3-031-20068-7_11(180-200)Online publication date: 23-Oct-2022

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media