Article

Generative Adversarial Network for Future Hand Segmentation from Egocentric Video

Authors:

James M. RehgAuthors Info & Claims

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIII

Pages 639 - 656

https://doi.org/10.1007/978-3-031-19778-9_37

Published: 23 October 2022 Publication History

Abstract

We introduce the novel problem of anticipating a time series of future hand masks from egocentric video. A key challenge is to model the stochasticity of future head motions, which globally impact the head-worn camera video analysis. To this end, we propose a novel deep generative model – EgoGAN. Our model first utilizes a 3D Fully Convolutional Network to learn a spatio-temporal video representation for pixel-wise visual anticipation. It then generates future head motion using the Generative Adversarial Network (GAN), and predicts the future hand masks based on both the encoded video representation and the generated future head motion. We evaluate our method on both the EPIC-Kitchens and the EGTEA Gaze+ datasets. We conduct detailed ablation studies to validate the design choices of our approach. Furthermore, we compare our method with previous state-of-the-art methods on future image segmentation and provide extensive analysis to show that our method can more accurately predict future hand masks. Project page: https://vjwq.github.io/EgoGAN/.

References

[1]

Cai, M., Lu, F., Sato, Y.: Generalizing hand segmentation in egocentric videos with uncertainty-guided model adaptation. In: CVPR (2020)

[2]

Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR (2017)

[3]

Chandra, S., Couprie, C., Kokkinos, I.: Deep spatio-temporal random fields for efficient video segmentation. In: CVPR (2018)

[4]

Chiu, H.K., Adeli, E., Niebles, J.C.: Segmenting the future. In: ICRA-L (2020)

[5]

Damen D et al. Ferrari V, Hebert M, Sminchisescu C, Weiss Y, et al. Scaling egocentric vision: the dataset Computer Vision – ECCV 2018 2018 Cham Springer 753-771

[6]

Dessalene, E., Devaraj, C., Maynord, M., Fermuller, C., Aloimonos, Y.: Forecasting action through contact representations from first person video. TPAMI (2021)

[7]

Fathi, A., Hodgins, J.K., Rehg, J.M.: Social interactions: a first-person perspective. In: CVPR (2012)

[8]

Fathi, A., Farhadi, A., Rehg, J.M.: Understanding egocentric activities. In: ICCV (2011)

[9]

Fathi A, Li Y, and Rehg JM Fitzgibbon A, Lazebnik S, Perona P, Sato Y, and Schmid C Learning to recognize daily actions using gaze Computer Vision – ECCV 2012 2012 Heidelberg Springer 314-327

[10]

Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)

[11]

Furnari, A., Farinella, G.M.: What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In: ICCV (2019)

[12]

Gao, J., Yang, Z., Nevatia, R.: Red: reinforced encoder-decoder networks for action anticipation. In: BMVC (2017)

[13]

Girdhar, R., Grauman, K.: Anticipative video transformer. In: ICCV (2021)

[14]

Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS (2014)

[15]

Gregor, K., Danihelka, I., Graves, A., Rezende, D., Wierstra, D.: Draw: a recurrent neural network for image generation. In: International Conference on Machine Learning, pp. 1462–1471. PMLR (2015)

[16]

Guan, J., Yuan, Y., Kitani, K.M., Rhinehart, N.: Generative hybrid representations for activity forecasting with no-regret learning. In: CVPR (2020)

[17]

Gui L-Y, Wang Y-X, Liang X, and Moura JMF Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Adversarial geometry-aware human motion prediction Computer Vision – ECCV 2018 2018 Cham Springer 823-842

[18]

Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., Alahi, A.: Social gan: socially acceptable trajectories with generative adversarial networks. In: CVPR (2018)

[19]

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

[20]

Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)

[21]

Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)

[22]

Jin, X., et al.: Predicting scene parsing and motion dynamics in the future. In: NeurIPS (2017)

[23]

Kataoka, H., Miyashita, Y., Hayashi, M., Iwata, K., Satoh, Y.: Recognition of transitional action for short-term action prediction using discriminative temporal cnn feature. In: BMVC (2016)

[24]

Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In: CVPR (2019)

[25]

Kitani KM, Ziebart BD, Bagnell JA, and Hebert M Fitzgibbon A, Lazebnik S, Perona P, Sato Y, and Schmid C Activity forecasting Computer Vision – ECCV 2012 2012 Heidelberg Springer 201-214

[26]

Li, Y.: Learning embodied models of actions from first person video. Ph.D. thesis, Georgia Institute of Technology (2017)

[27]

Li, Y., Fathi, A., Rehg, J.M.: Learning to predict gaze in egocentric video. In: ICCV (2013)

[28]

Li Y, Liu M, and Rehg JM Ferrari V, Hebert M, Sminchisescu C, and Weiss Y In the eye of beholder: joint learning of gaze and actions in first person video Computer Vision – ECCV 2018 2018 Cham Springer 639-655

[29]

Li, Y., Liu, M., Rehg, J.M.: In the eye of the beholder: gaze and actions in first person video. TPAMI (2021)

[30]

Li, Y., Ye, Z., Rehg, J.M.: Delving into egocentric actions. In: CVPR (2015)

[31]

Liu, M., et al.: Egocentric activity recognition and localization on a 3D map. arXiv preprint arXiv:2105.09544 (2021)

[32]

Liu M, Tang S, Li Y, and Rehg JM Vedaldi A, Bischof H, Brox T, and Frahm J-M Forecasting human-object interaction: joint prediction of motor attention and actions in first person video Computer Vision – ECCV 2020 2020 Cham Springer 704-721

[33]

Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)

[34]

Luc P, Couprie C, LeCun Y, and Verbeek J Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Predicting future instance segmentation by forecasting convolutional features Computer Vision – ECCV 2018 2018 Cham Springer 593-608

[35]

Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: ICCV (2017)

[36]

Ma, M., Fan, H., Kitani, K.M.: Going deeper into first-person activity recognition. In: CVPR (2016)

[37]

Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)

[38]

Moltisanti, D., Wray, M., Mayol-Cuevas, W., Damen, D.: Trespassing the boundaries: labeling temporal bounds for object interactions in egocentric video. In: ICCV (2017)

[39]

Nilsson, D., Sminchisescu, C.: Semantic video segmentation by gated recurrent flow propagation. In: CVPR (2018)

[40]

Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier gans. In: ICML (2017)

[41]

Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: CVPR (2016)

[42]

Pelz J, Hayhoe M, and Loeber R The coordination of eye, head, and hand movements in a natural task Exp. Brain Res. 2001 139 3 266-277

[43]

Pérez, J.S., Meinhardt-Llopis, E., Facciolo, G.: TV-L1 optical flow estimation. In: IPOL (2013)

[44]

Poleg, Y., Arora, C., Peleg, S.: Temporal segmentation of egocentric videos. In: CVPR (2014)

[45]

Poleg, Y., Ephrat, A., Peleg, S., Arora, C.: Compact CNN for indexing egocentric videos. In: WACV (2016)

[46]

Rochan, M., et al.: Future semantic segmentation with convolutional lstm. In: BMVC (2018)

[47]

Rodriguez C, Fernando B, and Li H Leal-Taixé L and Roth S Action anticipation by predicting future dynamic images Computer Vision – ECCV 2018 Workshops 2019 Cham Springer 89-105

[48]

Shan, D., Geng, J., Shu, M., Fouhey, D.: Understanding human hands in contact at internet scale. In: CVPR (2020)

[49]

Shen Y, Ni B, Li Z, and Zhuang N Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Egocentric activity prediction via event modulated attention Computer Vision – ECCV 2018 2018 Cham Springer 202-217

[50]

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NeurIPS (2014)

[51]

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)

[52]

Soo Park, H., Shi, J.: Social saliency prediction. In: CVPR (2015)

[53]

Soran, B., Farhadi, A., Shapiro, L.: Generating notifications for missing actions: Don’t forget to turn the lights off! In: ICCV (2015)

[54]

Tsai, Y.H., Yang, M.H., Black, M.J.: Video segmentation via object flow. In: CVPR (2016)

[55]

Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: Mocogan: decomposing motion and content for video generation. In: CVPR (2018)

[56]

Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: CVPR (2016)

[57]

Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NeurIPS (2016)

[58]

Walker J, Doersch C, Gupta A, and Hebert M Leibe B, Matas J, Sebe N, and Welling M An uncertain future: forecasting from static images using variational autoencoders Computer Vision – ECCV 2016 2016 Cham Springer 835-851

[59]

Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: ICCV (2017)

[60]

Wang, W., Zhou, T., Porikli, F., Crandall, D., Van Gool, L.: A survey on deep learning technique for video segmentation. arXiv preprint arXiv:2107.01153 (2021)

[61]

Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)

[62]

Xu, Y.S., Fu, T.J., Yang, H.K., Lee, C.Y.: Dynamic video segmentation network. In: CVPR (2018)

[63]

Yagi, T., Mangalam, K., Yonetani, R., Sato, Y.: Future person localization in first-person videos. In: CVPR (2018)

[64]

Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV (2019)

[65]

Yonetani, R., Kitani, K.M., Sato, Y.: Recognizing micro-actions and reactions from paired egocentric videos. In: CVPR (2016)

[66]

Zhang, H., et al.: Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV (2017)

[67]

Zhang, M., Teck Ma, K., Hwee Lim, J., Zhao, Q., Feng, J.: Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In: CVPR (2017)

[68]

Zhang, Y., Black, M.J., Tang, S.: We are more than our joints: predicting how 3D bodies move. In: CVPR (2021)

[69]

Zhang, Y., Hassan, M., Neumann, H., Black, M.J., Tang, S.: Generating 3D people in scenes without people. In: CVPR (2020)

Cited By

Yun HGao RAnanthabhotla IKumar ADonley JLi CKim GIthapu VMurdock C(2024)Spherical World-Locking for Audio-Visual Localization in Egocentric VideosComputer Vision – ECCV 202410.1007/978-3-031-72691-0_15(256-274)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72691-0_15
Lai BDai XChen LPang GRehg JLiu M(2024)LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction TuningComputer Vision – ECCV 202410.1007/978-3-031-72673-6_8(135-155)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72673-6_8

Index Terms

Generative Adversarial Network for Future Hand Segmentation from Egocentric Video
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
        Scene understanding
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory

Index terms have been assigned to the content through auto-classification.

Recommendations

Left/right hand segmentation in egocentric videos
Abstract
Wearable cameras allow people to record their daily activities from a user-centered (First Person Vision) perspective. Due to their favorable location, wearable cameras frequently capture the hands of the user, and may thus represent a ...
Analysis of hand segmentation on challenging hand over face scenario
PETRA '19: Proceedings of the 12th ACM International Conference on PErvasive Technologies Related to Assistive Environments

One of the challenging problems in computer vision is hand segmentation, especially when the hands overlap with the face. There are many applications that require this type of segmentation, such as sign language recognition, action recognition and ...
Learning to Predict Gaze in Egocentric Video
ICCV '13: Proceedings of the 2013 IEEE International Conference on Computer Vision

We present a model for gaze prediction in egocentric video by leveraging the implicit cues that exist in camera wearer's behaviors. Specifically, we compute the camera wearer's head motion and hand location from the video and combine them to estimate ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIII

Oct 2022

803 pages

ISBN:978-3-031-19777-2

DOI:10.1007/978-3-031-19778-9

Editors:
Shai Avidan
Tel Aviv University, Tel Aviv, Israel
,
Gabriel Brostow
University College London, London, UK
,
Moustapha Cissé
Google AI, Accra, Ghana
,
Giovanni Maria Farinella
University of Catania, Catania, Italy
,
Tal Hassner
Facebook (United States), Menlo Park, CA, USA

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 October 2022

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yun HGao RAnanthabhotla IKumar ADonley JLi CKim GIthapu VMurdock C(2024)Spherical World-Locking for Audio-Visual Localization in Egocentric VideosComputer Vision – ECCV 202410.1007/978-3-031-72691-0_15(256-274)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72691-0_15
Lai BDai XChen LPang GRehg JLiu M(2024)LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction TuningComputer Vision – ECCV 202410.1007/978-3-031-72673-6_8(135-155)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72673-6_8

View Options

View options

Figures

Tables

Media

View Table of Conten