research-article

Free access

Just Accepted

Depth Matters: Spatial Proximity-based Gaze Cone Generation for Gaze Following in Wild

Authors:

Dan GuoAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications

Accepted on 13 August 2024

https://doi.org/10.1145/3689643

Online AM: 26 August 2024 Publication History

Abstract

Gaze following aims to predict where a person is looking in a scene. Existing methods tend to prioritize traditional 2D RGB visual cues or require burdensome prior knowledge and extra expensive datasets annotated in 3D coordinate systems to train specialized modules to enhance scene modeling. In this work, we introduce a novel framework deployed on a simple ResNet backbone, which exclusively uses image and depth maps to mimic human visual preferences and realize 3D-like depth perception. We first leverage depth maps to formulate spatial-based proximity information regarding the objects with the target person. This process sharpens the focus of the gaze cone on the specific region of interest pertaining to the target while diminishing the impact of surrounding distractions. To capture the diverse dependence of scene context on the saliency gaze cone, we then introduce a learnable grid-level regularized attention that anticipates coarse-grained regions of interest, thereby refining the mapping of the saliency feature to pixel-level heatmaps. This allows our model to better account for individual differences when predicting others’ gaze locations. Finally, we employ the KL-divergence loss to super the grid-level regularized attention, which combines the gaze direction, heatmap regression, and in/out classification losses, providing comprehensive supervision for model optimization. Experimental results on two publicly available datasets demonstrate the comparable performance of our model with less help of modal information. Quantitative visualization results further validate the interpretability of our method. The source code will be available at https://github.com/VUT-HFUT/DepthMatters.

References

[1]

Dimitris Agrafiotis, Sam JC Davies, N Canagarajah, and David R Bull. 2007. Towards efficient context-specific video coding based on gaze-tracking analysis. ACM TOMM 3, 4 (2007), 1–15.

Digital Library

[2]

Jun Bao, Buyu Liu, and Jun Yu. 2022. Escnet: Gaze target detection with the understanding of 3d scenes. In CVPR. 14126–14135.

[3]

Ernesto Brau, Jinyan Guan, Tanya Jeffries, and Kobus Barnard. 2018. Multiple-gaze geometry: Inferring novel 3d locations from gazes observed in monocular video. In ECCV. 612–630.

[4]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV. 213–229.

[5]

Jia-Ren Chang and Yong-Sheng Chen. 2018. Pyramid stereo matching network. In CVPR. 5410–5418.

[6]

Baian Chen, Zhilei Chen, Xiaowei Hu, Jun Xu, Haoran Xie, Jing Qin, and Mingqiang Wei. 2023. Dynamic message propagation network for RGB-D and video salient object detection. ACM TOMM 20, 1 (2023), 1–21.

[7]

Yihua Cheng, Xucong Zhang, Feng Lu, and Yoichi Sato. 2020. Gaze estimation by exploring two-eye asymmetry. IEEE TIP 29 (2020), 5259–5272.

[8]

Eunji Chong, Nataniel Ruiz, Yongxin Wang, Yun Zhang, Agata Rozga, and James M Rehg. 2018. Connecting gaze, scene, and attention: Generalized attention estimation via joint modeling of gaze and scene saliency. In ECCV. 383–398.

[9]

Eunji Chong, Yongxin Wang, Nataniel Ruiz, and James M Rehg. 2020. Detecting attended visual targets in video. In CVPR. 5396–5406.

[10]

Meir Cohen, Ilan Shimshoni, Ehud Rivlin, and Amit Adam. 2012. Detecting mutual awareness events. IEEE TPAMI 34, 12 (2012), 2327–2340.

Digital Library

[11]

Lifeng Fan, Yixin Chen, Ping Wei, Wenguan Wang, and Song-Chun Zhu. 2018. Inferring shared attention in social scene videos. In CVPR. 6460–6468.

[12]

Yi Fang, Jiapeng Tang, Wang Shen, Wei Shen, Xiao Gu, Li Song, and Guangtao Zhai. 2021. Dual attention guided gaze target detection in the wild. In CVPR. 11390–11399.

[13]

Alireza Fathi, Yin Li, and James M Rehg. 2012. Learning to recognize daily actions using gaze. In ECCV. 314–327.

[14]

Christian Forster, Matia Pizzoli, and Davide Scaramuzza. 2014. SVO: Fast semi-direct monocular visual odometry. In ICRA. 15–22.

[15]

Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. 2015. Learning with a Wasserstein loss. NeurIPS 28 (2015).

[16]

Kenneth Alberto Funes Mora, Florent Monay, and Jean-Marc Odobez. 2014. Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In ETRA. 255–258.

Digital Library

[17]

James J Gibson. 1978. The ecological approach to the visual perception of pictures. Leonardo 11, 3 (1978), 227–235.

[18]

Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense human pose estimation in the wild. In CVPR. 7297–7306.

[19]

Dan Guo, Kun Li, Bin Hu, Yan Zhang, and Meng Wang. 2024. Benchmarking Micro-action Recognition: Dataset, Methods, and Applications. IEEE TCSVT 34, 7 (2024), 6238–6252.

[20]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.

[21]

Chih-Fan Hsu, Yu-Shuen Wang, Chin-Laung Lei, and Kuan-Ta Chen. 2019. Look at me! Correcting eye gaze in live video communication. ACM TOMM 15, 2 (2019), 1–21.

Digital Library

[22]

Laurent Itti and Christof Koch. 2000. A saliency-based search mechanism for overt and covert shifts of visual attention. Vision research 40, 10-12 (2000), 1489–1506.

[23]

Vikram K Jaswal and Nameera Akhtar. 2019. Being versus appearing socially uninterested: Challenging assumptions about social motivation in autism. Behavioral and Brain Sciences 42 (2019), e82.

[24]

Tianlei Jin, Zheyuan Lin, Shiqiang Zhu, Wen Wang, and Shunda Hu. 2021. Multi-person gaze-following with numerical coordinate regression. In IEEE FG. 01–08.

[25]

Tianlei Jin, Qizhi Yu, Shiqiang Zhu, Zheyuan Lin, Jie Ren, Yuanhai Zhou, and Wei Song. 2022. Depth-aware gaze-following via auxiliary networks for robotics. EAAI 113 (2022), 104924.

Digital Library

[26]

Tilke Judd, Krista Ehinger, Frédo Durand, and Antonio Torralba. 2009. Learning to predict where humans look. In ICCV. 2106–2113.

[27]

Petr Kellnhofer, Adria Recasens, Simon Stent, Wojciech Matusik, and Antonio Torralba. 2019. Gaze360: Physically unconstrained gaze estimation in the wild. In ICCV. 6912–6921.

[28]

Christian Kerl, Jürgen Sturm, and Daniel Cremers. 2013. Dense visual SLAM for RGB-D cameras. In IROS. 2100–2106.

[29]

Kun Li, Dan Guo, and Meng Wang. 2021. Proposal-free video grounding with contextual pyramid network. In AAAI, Vol. 35. 1902–1910.

[30]

Kun Li, Dan Guo, and Meng Wang. 2023. ViGT: proposal-free video grounding with a learnable token in the transformer. SCIS 66, 10 (2023), 202102.

[31]

Kun Li, Jiaxiu Li, Dan Guo, Xun Yang, and Meng Wang. 2023. Transformer-based Visual Grounding with Cross-modality Interaction. ACM TOMM 19, 6 (2023), 1–19.

Digital Library

[32]

Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu, and William T Freeman. 2019. Learning the depths of moving people by watching frozen people. In CVPR. 4521–4530.

[33]

Dongze Lian, Zehao Yu, and Shenghua Gao. 2018. Believe it or not, we know what you are looking at!. In ACCV. 35–50.

[34]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. 740–755.

[35]

Chunpu Liu, Guanglei Yang, Wangmeng Zuo, and Tianyi Zang. 2024. DPDFormer: a Coarse-to-Fine Model for Monocular Depth Estimation. ACM TOMM 20, 5 (2024), 1–21.

Digital Library

[36]

BGDA Madhusanka, Sureswaran Ramadass, Premkumar Rajagopal, and HMKKMB Herath. 2022. Biofeedback method for human–computer interaction to improve elder caring: Eye-gaze tracking. In Predictive Modeling in Biomedical Data Mining and Analysis. 137–156.

[37]

Päivi Majaranta, Kari-Jouko Räihä, Aulikki Hyrskykari, and Oleg Špakov. 2019. Eye movements and human-computer interaction. Eye movement research: An introduction to its scientific foundations and applications (2019), 971–1015.

[38]

Manuel J Marin-Jimenez, Vicky Kalogeiton, Pablo Medina-Suarez, and Andrew Zisserman. 2019. Laeo-net: revisiting people looking at each other in videos. In CVPR. 3477–3485.

[39]

Benoît Massé, Silèye Ba, and Radu Horaud. 2017. Tracking gaze and visual focus of attention of people involved in social interaction. IEEE TPAMI 40, 11 (2017), 2711–2724.

Digital Library

[40]

Benoit Massé, Stéphane Lathuilière, Pablo Mesejo, and Radu Horaud. 2019. Extended gaze following: Detecting objects in videos beyond the camera field of view. In IEEE FG. 1–8.

[41]

Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Srinath Sridhar, Gerard Pons-Moll, and Christian Theobalt. 2018. Single-shot multi-person 3d pose estimation from monocular rgb. In 3DV. 120–130.

[42]

ML Menéndez, JA Pardo, L Pardo, and MC Pardo. 1997. The jensen-shannon divergence. Journal of the Franklin Institute 334, 2 (1997), 307–318.

[43]

Qiaomu Miao, Minh Hoai, and Dimitris Samaras. 2023. Patch-level Gaze Distribution Prediction for Gaze Following. In WACV. 880–889.

[44]

Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor. 2016. Shallow and deep convolutional networks for saliency prediction. In CVPR. 598–606.

[45]

Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. 2019. Stand-alone self-attention in vision models. NeurIPS 32 (2019).

[46]

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. 2020. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI 44, 3 (2020), 1623–1637.

[47]

Adria Recasens, Aditya Khosla, Carl Vondrick, and Antonio Torralba. 2015. Where are they looking? NeurIPS (2015), 1–9.

[48]

Adria Recasens, Carl Vondrick, Aditya Khosla, and Antonio Torralba. 2017. Following gaze in video. In ICCV. 1435–1443.

[49]

Omer Sumer, Peter Gerjets, Ulrich Trautwein, and Enkelejda Kasneci. 2020. Attention flow: End-to-end joint attention estimation. In WACV. 3327–3336.

[50]

Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir Navab. 2017. Cnn-slam: Real-time dense monocular slam with learned depth prediction. In CVPR. 6243–6252.

[51]

Francesco Tonini, Nicola Dall’Asen, Cigdem Beyan, and Elisa Ricci. 2023. Object-aware Gaze Target Detection. In ICCV. 21860–21869.

[52]

Danyang Tu, Xiongkuo Min, Huiyu Duan, Guodong Guo, Guangtao Zhai, and Wei Shen. 2022. End-to-end human-gaze-target detection with transformers. In CVPR. 2192–2200.

[53]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 5998–6008.

[54]

Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Pérez. 2019. Dada: Depth-aware domain adaptation in semantic segmentation. In ICCV. 7364–7373.

[55]

Li Wang, Liang Du, Xiaoqing Ye, Yanwei Fu, Guodong Guo, Xiangyang Xue, Jianfeng Feng, and Li Zhang. 2021. Depth-conditioned dynamic message propagation for monocular 3d object detection. In CVPR. 454–463.

[56]

Ping Wei, Yang Liu, Tianmin Shu, Nanning Zheng, and Song-Chun Zhu. 2018. Where and why are they looking? jointly inferring human attention and intentions in complex tasks. In CVPR. 6801–6809.

[57]

Hao Zhao, Ming Lu, Anbang Yao, Yurong Chen, and Li Zhang. 2020. Learning to draw sight lines. IJCV 128 (2020), 1076–1100.

Digital Library

[58]

Jianan Zhen, Qi Fang, Jiaming Sun, Wentao Liu, Wei Jiang, Hujun Bao, and Xiaowei Zhou. 2020. Smap: Single-shot multi-person absolute 3d pose estimation. In ECCV. 550–566.

[59]

Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. NeurIPS 27 (2014).

[60]

Ning Zhuang, Bingbing Ni, Yi Xu, Xiaokang Yang, Wenjun Zhang, Zefan Li, and Wen Gao. 2019. MUGGLE: MUlti-stream group gaze learning and estimation. IEEE TCSVT 30, 10 (2019), 3637–3650.

Index Terms

Depth Matters: Spatial Proximity-based Gaze Cone Generation for Gaze Following in Wild
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks

Recommendations

Saliency and optical flow for gaze guidance in videos
SAP '16: Proceedings of the ACM Symposium on Applied Perception

Computer-based gaze guidance techniques have important applications in computer graphics, data visualization, image analysis, and training. Bailey et al. [2009] showed that it is possible to influence exactly where attention is allocated using a ...
Subtle gaze direction

This article presents a novel technique that combines eye-tracking with subtle image-space modulation to direct a viewer's gaze about a digital image. We call this paradigm subtle gaze direction. Subtle gaze direction exploits the fact that our ...
Beyond gaze: preliminary analysis of pupil dilation and blink rates in an fMRI study of program comprehension
EMIP '18: Proceedings of the Workshop on Eye Movements in Programming

Researchers have been employing psycho-physiological measures to better understand program comprehension, for example simultaneous fMRI and eye tracking to validate top-down comprehension models. In this paper, we argue that there is additional value in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Just Accepted

EISSN:1551-6865

Table of Contents

Copyright © 2024 Copyright held by the owner/author(s).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 26 August 2024

Accepted: 13 August 2024

Revised: 12 August 2024

Received: 06 March 2024

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
93
Total Downloads

Downloads (Last 12 months)93
Downloads (Last 6 weeks)68

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables