Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Free access
Just Accepted

Depth Matters: Spatial Proximity-based Gaze Cone Generation for Gaze Following in Wild

Online AM: 26 August 2024 Publication History

Abstract

Gaze following aims to predict where a person is looking in a scene. Existing methods tend to prioritize traditional 2D RGB visual cues or require burdensome prior knowledge and extra expensive datasets annotated in 3D coordinate systems to train specialized modules to enhance scene modeling. In this work, we introduce a novel framework deployed on a simple ResNet backbone, which exclusively uses image and depth maps to mimic human visual preferences and realize 3D-like depth perception. We first leverage depth maps to formulate spatial-based proximity information regarding the objects with the target person. This process sharpens the focus of the gaze cone on the specific region of interest pertaining to the target while diminishing the impact of surrounding distractions. To capture the diverse dependence of scene context on the saliency gaze cone, we then introduce a learnable grid-level regularized attention that anticipates coarse-grained regions of interest, thereby refining the mapping of the saliency feature to pixel-level heatmaps. This allows our model to better account for individual differences when predicting others’ gaze locations. Finally, we employ the KL-divergence loss to super the grid-level regularized attention, which combines the gaze direction, heatmap regression, and in/out classification losses, providing comprehensive supervision for model optimization. Experimental results on two publicly available datasets demonstrate the comparable performance of our model with less help of modal information. Quantitative visualization results further validate the interpretability of our method. The source code will be available at https://github.com/VUT-HFUT/DepthMatters.

References

[1]
Dimitris Agrafiotis, Sam JC Davies, N Canagarajah, and David R Bull. 2007. Towards efficient context-specific video coding based on gaze-tracking analysis. ACM TOMM 3, 4 (2007), 1–15.
[2]
Jun Bao, Buyu Liu, and Jun Yu. 2022. Escnet: Gaze target detection with the understanding of 3d scenes. In CVPR. 14126–14135.
[3]
Ernesto Brau, Jinyan Guan, Tanya Jeffries, and Kobus Barnard. 2018. Multiple-gaze geometry: Inferring novel 3d locations from gazes observed in monocular video. In ECCV. 612–630.
[4]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV. 213–229.
[5]
Jia-Ren Chang and Yong-Sheng Chen. 2018. Pyramid stereo matching network. In CVPR. 5410–5418.
[6]
Baian Chen, Zhilei Chen, Xiaowei Hu, Jun Xu, Haoran Xie, Jing Qin, and Mingqiang Wei. 2023. Dynamic message propagation network for RGB-D and video salient object detection. ACM TOMM 20, 1 (2023), 1–21.
[7]
Yihua Cheng, Xucong Zhang, Feng Lu, and Yoichi Sato. 2020. Gaze estimation by exploring two-eye asymmetry. IEEE TIP 29 (2020), 5259–5272.
[8]
Eunji Chong, Nataniel Ruiz, Yongxin Wang, Yun Zhang, Agata Rozga, and James M Rehg. 2018. Connecting gaze, scene, and attention: Generalized attention estimation via joint modeling of gaze and scene saliency. In ECCV. 383–398.
[9]
Eunji Chong, Yongxin Wang, Nataniel Ruiz, and James M Rehg. 2020. Detecting attended visual targets in video. In CVPR. 5396–5406.
[10]
Meir Cohen, Ilan Shimshoni, Ehud Rivlin, and Amit Adam. 2012. Detecting mutual awareness events. IEEE TPAMI 34, 12 (2012), 2327–2340.
[11]
Lifeng Fan, Yixin Chen, Ping Wei, Wenguan Wang, and Song-Chun Zhu. 2018. Inferring shared attention in social scene videos. In CVPR. 6460–6468.
[12]
Yi Fang, Jiapeng Tang, Wang Shen, Wei Shen, Xiao Gu, Li Song, and Guangtao Zhai. 2021. Dual attention guided gaze target detection in the wild. In CVPR. 11390–11399.
[13]
Alireza Fathi, Yin Li, and James M Rehg. 2012. Learning to recognize daily actions using gaze. In ECCV. 314–327.
[14]
Christian Forster, Matia Pizzoli, and Davide Scaramuzza. 2014. SVO: Fast semi-direct monocular visual odometry. In ICRA. 15–22.
[15]
Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. 2015. Learning with a Wasserstein loss. NeurIPS 28 (2015).
[16]
Kenneth Alberto Funes Mora, Florent Monay, and Jean-Marc Odobez. 2014. Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In ETRA. 255–258.
[17]
James J Gibson. 1978. The ecological approach to the visual perception of pictures. Leonardo 11, 3 (1978), 227–235.
[18]
Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense human pose estimation in the wild. In CVPR. 7297–7306.
[19]
Dan Guo, Kun Li, Bin Hu, Yan Zhang, and Meng Wang. 2024. Benchmarking Micro-action Recognition: Dataset, Methods, and Applications. IEEE TCSVT 34, 7 (2024), 6238–6252.
[20]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.
[21]
Chih-Fan Hsu, Yu-Shuen Wang, Chin-Laung Lei, and Kuan-Ta Chen. 2019. Look at me! Correcting eye gaze in live video communication. ACM TOMM 15, 2 (2019), 1–21.
[22]
Laurent Itti and Christof Koch. 2000. A saliency-based search mechanism for overt and covert shifts of visual attention. Vision research 40, 10-12 (2000), 1489–1506.
[23]
Vikram K Jaswal and Nameera Akhtar. 2019. Being versus appearing socially uninterested: Challenging assumptions about social motivation in autism. Behavioral and Brain Sciences 42 (2019), e82.
[24]
Tianlei Jin, Zheyuan Lin, Shiqiang Zhu, Wen Wang, and Shunda Hu. 2021. Multi-person gaze-following with numerical coordinate regression. In IEEE FG. 01–08.
[25]
Tianlei Jin, Qizhi Yu, Shiqiang Zhu, Zheyuan Lin, Jie Ren, Yuanhai Zhou, and Wei Song. 2022. Depth-aware gaze-following via auxiliary networks for robotics. EAAI 113 (2022), 104924.
[26]
Tilke Judd, Krista Ehinger, Frédo Durand, and Antonio Torralba. 2009. Learning to predict where humans look. In ICCV. 2106–2113.
[27]
Petr Kellnhofer, Adria Recasens, Simon Stent, Wojciech Matusik, and Antonio Torralba. 2019. Gaze360: Physically unconstrained gaze estimation in the wild. In ICCV. 6912–6921.
[28]
Christian Kerl, Jürgen Sturm, and Daniel Cremers. 2013. Dense visual SLAM for RGB-D cameras. In IROS. 2100–2106.
[29]
Kun Li, Dan Guo, and Meng Wang. 2021. Proposal-free video grounding with contextual pyramid network. In AAAI, Vol. 35. 1902–1910.
[30]
Kun Li, Dan Guo, and Meng Wang. 2023. ViGT: proposal-free video grounding with a learnable token in the transformer. SCIS 66, 10 (2023), 202102.
[31]
Kun Li, Jiaxiu Li, Dan Guo, Xun Yang, and Meng Wang. 2023. Transformer-based Visual Grounding with Cross-modality Interaction. ACM TOMM 19, 6 (2023), 1–19.
[32]
Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu, and William T Freeman. 2019. Learning the depths of moving people by watching frozen people. In CVPR. 4521–4530.
[33]
Dongze Lian, Zehao Yu, and Shenghua Gao. 2018. Believe it or not, we know what you are looking at!. In ACCV. 35–50.
[34]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. 740–755.
[35]
Chunpu Liu, Guanglei Yang, Wangmeng Zuo, and Tianyi Zang. 2024. DPDFormer: a Coarse-to-Fine Model for Monocular Depth Estimation. ACM TOMM 20, 5 (2024), 1–21.
[36]
BGDA Madhusanka, Sureswaran Ramadass, Premkumar Rajagopal, and HMKKMB Herath. 2022. Biofeedback method for human–computer interaction to improve elder caring: Eye-gaze tracking. In Predictive Modeling in Biomedical Data Mining and Analysis. 137–156.
[37]
Päivi Majaranta, Kari-Jouko Räihä, Aulikki Hyrskykari, and Oleg Špakov. 2019. Eye movements and human-computer interaction. Eye movement research: An introduction to its scientific foundations and applications (2019), 971–1015.
[38]
Manuel J Marin-Jimenez, Vicky Kalogeiton, Pablo Medina-Suarez, and Andrew Zisserman. 2019. Laeo-net: revisiting people looking at each other in videos. In CVPR. 3477–3485.
[39]
Benoît Massé, Silèye Ba, and Radu Horaud. 2017. Tracking gaze and visual focus of attention of people involved in social interaction. IEEE TPAMI 40, 11 (2017), 2711–2724.
[40]
Benoit Massé, Stéphane Lathuilière, Pablo Mesejo, and Radu Horaud. 2019. Extended gaze following: Detecting objects in videos beyond the camera field of view. In IEEE FG. 1–8.
[41]
Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Srinath Sridhar, Gerard Pons-Moll, and Christian Theobalt. 2018. Single-shot multi-person 3d pose estimation from monocular rgb. In 3DV. 120–130.
[42]
ML Menéndez, JA Pardo, L Pardo, and MC Pardo. 1997. The jensen-shannon divergence. Journal of the Franklin Institute 334, 2 (1997), 307–318.
[43]
Qiaomu Miao, Minh Hoai, and Dimitris Samaras. 2023. Patch-level Gaze Distribution Prediction for Gaze Following. In WACV. 880–889.
[44]
Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor. 2016. Shallow and deep convolutional networks for saliency prediction. In CVPR. 598–606.
[45]
Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. 2019. Stand-alone self-attention in vision models. NeurIPS 32 (2019).
[46]
René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. 2020. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI 44, 3 (2020), 1623–1637.
[47]
Adria Recasens, Aditya Khosla, Carl Vondrick, and Antonio Torralba. 2015. Where are they looking? NeurIPS (2015), 1–9.
[48]
Adria Recasens, Carl Vondrick, Aditya Khosla, and Antonio Torralba. 2017. Following gaze in video. In ICCV. 1435–1443.
[49]
Omer Sumer, Peter Gerjets, Ulrich Trautwein, and Enkelejda Kasneci. 2020. Attention flow: End-to-end joint attention estimation. In WACV. 3327–3336.
[50]
Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir Navab. 2017. Cnn-slam: Real-time dense monocular slam with learned depth prediction. In CVPR. 6243–6252.
[51]
Francesco Tonini, Nicola Dall’Asen, Cigdem Beyan, and Elisa Ricci. 2023. Object-aware Gaze Target Detection. In ICCV. 21860–21869.
[52]
Danyang Tu, Xiongkuo Min, Huiyu Duan, Guodong Guo, Guangtao Zhai, and Wei Shen. 2022. End-to-end human-gaze-target detection with transformers. In CVPR. 2192–2200.
[53]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 5998–6008.
[54]
Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Pérez. 2019. Dada: Depth-aware domain adaptation in semantic segmentation. In ICCV. 7364–7373.
[55]
Li Wang, Liang Du, Xiaoqing Ye, Yanwei Fu, Guodong Guo, Xiangyang Xue, Jianfeng Feng, and Li Zhang. 2021. Depth-conditioned dynamic message propagation for monocular 3d object detection. In CVPR. 454–463.
[56]
Ping Wei, Yang Liu, Tianmin Shu, Nanning Zheng, and Song-Chun Zhu. 2018. Where and why are they looking? jointly inferring human attention and intentions in complex tasks. In CVPR. 6801–6809.
[57]
Hao Zhao, Ming Lu, Anbang Yao, Yurong Chen, and Li Zhang. 2020. Learning to draw sight lines. IJCV 128 (2020), 1076–1100.
[58]
Jianan Zhen, Qi Fang, Jiaming Sun, Wentao Liu, Wei Jiang, Hujun Bao, and Xiaowei Zhou. 2020. Smap: Single-shot multi-person absolute 3d pose estimation. In ECCV. 550–566.
[59]
Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. NeurIPS 27 (2014).
[60]
Ning Zhuang, Bingbing Ni, Yi Xu, Xiaokang Yang, Wenjun Zhang, Zefan Li, and Wen Gao. 2019. MUGGLE: MUlti-stream group gaze learning and estimation. IEEE TCSVT 30, 10 (2019), 3637–3650.

Index Terms

  1. Depth Matters: Spatial Proximity-based Gaze Cone Generation for Gaze Following in Wild

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications Just Accepted
    EISSN:1551-6865
    Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Online AM: 26 August 2024
    Accepted: 13 August 2024
    Revised: 12 August 2024
    Received: 06 March 2024

    Check for updates

    Author Tags

    1. Gaze following
    2. Gaze cone
    3. Depth information
    4. Attention
    5. Saliency

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 93
      Total Downloads
    • Downloads (Last 12 months)93
    • Downloads (Last 6 weeks)68
    Reflects downloads up to 16 Oct 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media