Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-031-19778-9_37guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Generative Adversarial Network for Future Hand Segmentation from Egocentric Video

Published: 23 October 2022 Publication History

Abstract

We introduce the novel problem of anticipating a time series of future hand masks from egocentric video. A key challenge is to model the stochasticity of future head motions, which globally impact the head-worn camera video analysis. To this end, we propose a novel deep generative model – EgoGAN. Our model first utilizes a 3D Fully Convolutional Network to learn a spatio-temporal video representation for pixel-wise visual anticipation. It then generates future head motion using the Generative Adversarial Network (GAN), and predicts the future hand masks based on both the encoded video representation and the generated future head motion. We evaluate our method on both the EPIC-Kitchens and the EGTEA Gaze+ datasets. We conduct detailed ablation studies to validate the design choices of our approach. Furthermore, we compare our method with previous state-of-the-art methods on future image segmentation and provide extensive analysis to show that our method can more accurately predict future hand masks. Project page: https://vjwq.github.io/EgoGAN/.

References

[1]
Cai, M., Lu, F., Sato, Y.: Generalizing hand segmentation in egocentric videos with uncertainty-guided model adaptation. In: CVPR (2020)
[2]
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR (2017)
[3]
Chandra, S., Couprie, C., Kokkinos, I.: Deep spatio-temporal random fields for efficient video segmentation. In: CVPR (2018)
[4]
Chiu, H.K., Adeli, E., Niebles, J.C.: Segmenting the future. In: ICRA-L (2020)
[5]
Damen D et al. Ferrari V, Hebert M, Sminchisescu C, Weiss Y, et al. Scaling egocentric vision: the dataset Computer Vision – ECCV 2018 2018 Cham Springer 753-771
[6]
Dessalene, E., Devaraj, C., Maynord, M., Fermuller, C., Aloimonos, Y.: Forecasting action through contact representations from first person video. TPAMI (2021)
[7]
Fathi, A., Hodgins, J.K., Rehg, J.M.: Social interactions: a first-person perspective. In: CVPR (2012)
[8]
Fathi, A., Farhadi, A., Rehg, J.M.: Understanding egocentric activities. In: ICCV (2011)
[9]
Fathi A, Li Y, and Rehg JM Fitzgibbon A, Lazebnik S, Perona P, Sato Y, and Schmid C Learning to recognize daily actions using gaze Computer Vision – ECCV 2012 2012 Heidelberg Springer 314-327
[10]
Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)
[11]
Furnari, A., Farinella, G.M.: What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In: ICCV (2019)
[12]
Gao, J., Yang, Z., Nevatia, R.: Red: reinforced encoder-decoder networks for action anticipation. In: BMVC (2017)
[13]
Girdhar, R., Grauman, K.: Anticipative video transformer. In: ICCV (2021)
[14]
Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS (2014)
[15]
Gregor, K., Danihelka, I., Graves, A., Rezende, D., Wierstra, D.: Draw: a recurrent neural network for image generation. In: International Conference on Machine Learning, pp. 1462–1471. PMLR (2015)
[16]
Guan, J., Yuan, Y., Kitani, K.M., Rhinehart, N.: Generative hybrid representations for activity forecasting with no-regret learning. In: CVPR (2020)
[17]
Gui L-Y, Wang Y-X, Liang X, and Moura JMF Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Adversarial geometry-aware human motion prediction Computer Vision – ECCV 2018 2018 Cham Springer 823-842
[18]
Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., Alahi, A.: Social gan: socially acceptable trajectories with generative adversarial networks. In: CVPR (2018)
[19]
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
[20]
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
[21]
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)
[22]
Jin, X., et al.: Predicting scene parsing and motion dynamics in the future. In: NeurIPS (2017)
[23]
Kataoka, H., Miyashita, Y., Hayashi, M., Iwata, K., Satoh, Y.: Recognition of transitional action for short-term action prediction using discriminative temporal cnn feature. In: BMVC (2016)
[24]
Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In: CVPR (2019)
[25]
Kitani KM, Ziebart BD, Bagnell JA, and Hebert M Fitzgibbon A, Lazebnik S, Perona P, Sato Y, and Schmid C Activity forecasting Computer Vision – ECCV 2012 2012 Heidelberg Springer 201-214
[26]
Li, Y.: Learning embodied models of actions from first person video. Ph.D. thesis, Georgia Institute of Technology (2017)
[27]
Li, Y., Fathi, A., Rehg, J.M.: Learning to predict gaze in egocentric video. In: ICCV (2013)
[28]
Li Y, Liu M, and Rehg JM Ferrari V, Hebert M, Sminchisescu C, and Weiss Y In the eye of beholder: joint learning of gaze and actions in first person video Computer Vision – ECCV 2018 2018 Cham Springer 639-655
[29]
Li, Y., Liu, M., Rehg, J.M.: In the eye of the beholder: gaze and actions in first person video. TPAMI (2021)
[30]
Li, Y., Ye, Z., Rehg, J.M.: Delving into egocentric actions. In: CVPR (2015)
[31]
Liu, M., et al.: Egocentric activity recognition and localization on a 3D map. arXiv preprint arXiv:2105.09544 (2021)
[32]
Liu M, Tang S, Li Y, and Rehg JM Vedaldi A, Bischof H, Brox T, and Frahm J-M Forecasting human-object interaction: joint prediction of motor attention and actions in first person video Computer Vision – ECCV 2020 2020 Cham Springer 704-721
[33]
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
[34]
Luc P, Couprie C, LeCun Y, and Verbeek J Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Predicting future instance segmentation by forecasting convolutional features Computer Vision – ECCV 2018 2018 Cham Springer 593-608
[35]
Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: ICCV (2017)
[36]
Ma, M., Fan, H., Kitani, K.M.: Going deeper into first-person activity recognition. In: CVPR (2016)
[37]
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
[38]
Moltisanti, D., Wray, M., Mayol-Cuevas, W., Damen, D.: Trespassing the boundaries: labeling temporal bounds for object interactions in egocentric video. In: ICCV (2017)
[39]
Nilsson, D., Sminchisescu, C.: Semantic video segmentation by gated recurrent flow propagation. In: CVPR (2018)
[40]
Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier gans. In: ICML (2017)
[41]
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: CVPR (2016)
[42]
Pelz J, Hayhoe M, and Loeber R The coordination of eye, head, and hand movements in a natural task Exp. Brain Res. 2001 139 3 266-277
[43]
Pérez, J.S., Meinhardt-Llopis, E., Facciolo, G.: TV-L1 optical flow estimation. In: IPOL (2013)
[44]
Poleg, Y., Arora, C., Peleg, S.: Temporal segmentation of egocentric videos. In: CVPR (2014)
[45]
Poleg, Y., Ephrat, A., Peleg, S., Arora, C.: Compact CNN for indexing egocentric videos. In: WACV (2016)
[46]
Rochan, M., et al.: Future semantic segmentation with convolutional lstm. In: BMVC (2018)
[47]
Rodriguez C, Fernando B, and Li H Leal-Taixé L and Roth S Action anticipation by predicting future dynamic images Computer Vision – ECCV 2018 Workshops 2019 Cham Springer 89-105
[48]
Shan, D., Geng, J., Shu, M., Fouhey, D.: Understanding human hands in contact at internet scale. In: CVPR (2020)
[49]
Shen Y, Ni B, Li Z, and Zhuang N Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Egocentric activity prediction via event modulated attention Computer Vision – ECCV 2018 2018 Cham Springer 202-217
[50]
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NeurIPS (2014)
[51]
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
[52]
Soo Park, H., Shi, J.: Social saliency prediction. In: CVPR (2015)
[53]
Soran, B., Farhadi, A., Shapiro, L.: Generating notifications for missing actions: Don’t forget to turn the lights off! In: ICCV (2015)
[54]
Tsai, Y.H., Yang, M.H., Black, M.J.: Video segmentation via object flow. In: CVPR (2016)
[55]
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: Mocogan: decomposing motion and content for video generation. In: CVPR (2018)
[56]
Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: CVPR (2016)
[57]
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NeurIPS (2016)
[58]
Walker J, Doersch C, Gupta A, and Hebert M Leibe B, Matas J, Sebe N, and Welling M An uncertain future: forecasting from static images using variational autoencoders Computer Vision – ECCV 2016 2016 Cham Springer 835-851
[59]
Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: ICCV (2017)
[60]
Wang, W., Zhou, T., Porikli, F., Crandall, D., Van Gool, L.: A survey on deep learning technique for video segmentation. arXiv preprint arXiv:2107.01153 (2021)
[61]
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)
[62]
Xu, Y.S., Fu, T.J., Yang, H.K., Lee, C.Y.: Dynamic video segmentation network. In: CVPR (2018)
[63]
Yagi, T., Mangalam, K., Yonetani, R., Sato, Y.: Future person localization in first-person videos. In: CVPR (2018)
[64]
Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV (2019)
[65]
Yonetani, R., Kitani, K.M., Sato, Y.: Recognizing micro-actions and reactions from paired egocentric videos. In: CVPR (2016)
[66]
Zhang, H., et al.: Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV (2017)
[67]
Zhang, M., Teck Ma, K., Hwee Lim, J., Zhao, Q., Feng, J.: Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In: CVPR (2017)
[68]
Zhang, Y., Black, M.J., Tang, S.: We are more than our joints: predicting how 3D bodies move. In: CVPR (2021)
[69]
Zhang, Y., Hassan, M., Neumann, H., Black, M.J., Tang, S.: Generating 3D people in scenes without people. In: CVPR (2020)

Cited By

View all
  • (2024)Spherical World-Locking for Audio-Visual Localization in Egocentric VideosComputer Vision – ECCV 202410.1007/978-3-031-72691-0_15(256-274)Online publication date: 29-Sep-2024
  • (2024)LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction TuningComputer Vision – ECCV 202410.1007/978-3-031-72673-6_8(135-155)Online publication date: 29-Sep-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIII
Oct 2022
803 pages
ISBN:978-3-031-19777-2
DOI:10.1007/978-3-031-19778-9

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 October 2022

Author Tags

  1. Egocentric vision
  2. Hand segmentation
  3. Visual anticipantation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Spherical World-Locking for Audio-Visual Localization in Egocentric VideosComputer Vision – ECCV 202410.1007/978-3-031-72691-0_15(256-274)Online publication date: 29-Sep-2024
  • (2024)LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction TuningComputer Vision – ECCV 202410.1007/978-3-031-72673-6_8(135-155)Online publication date: 29-Sep-2024

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media