Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Hallucinating Pose-Compatible Scenes

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13676))

Included in the following conference series:

Abstract

What does human pose tell us about a scene? We propose a task to answer this question: given human pose as input, hallucinate a compatible scene. Subtle cues captured by human pose—action semantics, environment affordances, object interactions—provide surprising insight into which scenes are compatible. We present a large-scale generative adversarial network for pose-conditioned scene generation. We significantly scale the size and complexity of training data, curating a massive meta-dataset containing over 19 million frames of humans in everyday environments. We double the capacity of our model with respect to StyleGAN2 to handle such complex data, and design a pose conditioning mechanism that drives our model to learn the nuanced relationship between pose and scene. We leverage our trained model for various applications: hallucinating pose-compatible scene(s) with or without humans, visualizing incompatible scenes and poses, placing a person from one generated image into another scene, and animating pose. Our model produces diverse samples and outperforms pose-conditioned StyleGAN2 and Pix2Pix/Pix2PixHD baselines in terms of accurate human placement (percent of correct keypoints) and quality (Fréchet inception distance).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    For those unfamiliar with mime artists, here is a wonderful example performance: https://youtu.be/FPMBV3rd_hI.

  2. 2.

    https://www.timothybrooks.com/tech/hallucinating-scenes.

References

  1. Aberman, K., Shi, M., Liao, J., Lischinski, D., Chen, B., Cohen-Or, D.: Deep video-based performance cloning. In: Computer Graphics Forum. vol. 38, pp. 219–233. Wiley Online Library (2019)

    Google Scholar 

  2. Ackley, D.H., Hinton, G.E., Sejnowski, T.J.: A learning algorithm for boltzmann machines. Cogn. Sci. 9(1), 147–169 (1985)

    Article  Google Scholar 

  3. Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp. 3686–3693 (2014)

    Google Scholar 

  4. Balakrishnan, G., Zhao, A., Dalca, A.V., Durand, F., Guttag, J.: Synthesizing images of humans in unseen poses. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8340–8348 (2018)

    Google Scholar 

  5. Bau, D., et al.: Gan dissection: visualizing and understanding generative adversarial networks. In: Proceedings of the International Conference on Learning Representations (ICLR) (2019)

    Google Scholar 

  6. Biederman, I.: On the semantics of a glance at a scene. In: Perceptual Organization (1981)

    Google Scholar 

  7. Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=B1xsqj09Fm

  8. Cao, Z., Gao, H., Mangalam, K., Cai, Q., Vo, M., Malik, J.: Long-term human motion prediction with scene context. In: ECCV (2020)

    Google Scholar 

  9. Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 172–186 (2019)

    Article  Google Scholar 

  10. Cao, Z., Radosavovic, I., Kanazawa, A., Malik, J.: Reconstructing hand-object interactions in the wild. arXiv e-prints pp. arXiv-2012 (2020)

    Google Scholar 

  11. Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)

    Google Scholar 

  12. Chai, L., Wulff, J., Isola, P.: Using latent space regression to analyze and leverage compositionality in GANs. In: International Conference on Learning Representations (2021)

    Google Scholar 

  13. Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5933–5942 (2019)

    Google Scholar 

  14. Chuang, C.Y., Li, J., Torralba, A., Fidler, S.: Learning to act properly: predicting and explaining affordances from images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 975–983 (2018)

    Google Scholar 

  15. Delaitre, V., Fouhey, D., Laptev, I., Sivic, J., Gupta, A., Efros, A.: Scene semantics from long-term observation of people. In: Proceedings of 12th European Conference on Computer Vision (2012)

    Google Scholar 

  16. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR09 (2009)

    Google Scholar 

  17. Diba, A., et al.: Large scale holistic video understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 593–610. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_35

    Chapter  Google Scholar 

  18. Divvala, S.K., Hoiem, D., Hays, J.H., Efros, A.A., Hebert, M.: An empirical study of context in object detection. In: 2009 IEEE Conference on computer vision and Pattern Recognition, pp. 1271–1278. IEEE (2009)

    Google Scholar 

  19. Epstein, D., Chen, B., Vondrick, C.: Oops! predicting unintentional action in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 919–929 (2020)

    Google Scholar 

  20. Fan, A., Lewis, M., Dauphin, Y.: Hierarchical neural story generation. arXiv preprint arXiv:1805.04833 (2018)

  21. Fouhey, D.F., Delaitre, V., Gupta, A., Efros, A.A., Laptev, I., Sivic, J.: People watching: human actions as a cue for single view geometry. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 732–745. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_53

    Chapter  Google Scholar 

  22. Fouhey, D.F., Kuo, W.C., Efros, A.A., Malik, J.: From lifestyle vlogs to everyday interactions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4991–5000 (2018)

    Google Scholar 

  23. Fouhey, D.F., Wang, X., Gupta, A.: In defense of the direct perception of affordances (2015)

    Google Scholar 

  24. Gibson, J.J.: The Ecological Approach to Visual Perception. Houghton Mifflin, Boston (1979)

    Google Scholar 

  25. Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8359–8367 (2018)

    Google Scholar 

  26. Goetschalckx, L., Andonian, A., Oliva, A., Isola, P.: Ganalyze: toward visual definitions of cognitive image properties. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5744–5753 (2019)

    Google Scholar 

  27. Goodfellow, I.J., et al.: Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014)

  28. Grabner, H., Gall, J., Van Gool, L.: What makes a chair a chair? In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1529–1536, June 2011. https://doi.org/10.1109/CVPR.2011.5995327

  29. Gupta, A., Satkin, S., Efros, A.A., Hebert, M.: From 3d scene geometry to human workspace. In: Computer Vision and Pattern Recognition (CVPR) (2011)

    Google Scholar 

  30. Hassan, M., Ghosh, P., Tesch, J., Tzionas, D., Black, M.J.: Populating 3D scenes by learning human-scene interaction. In: Proceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2021

    Google Scholar 

  31. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

    Google Scholar 

  32. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, pp. 6629–6640 (2017)

    Google Scholar 

  33. Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 (2019)

  34. Härkönen, E., Hertzmann, A., Lehtinen, J., Paris, S.: Ganspace: discovering interpretable GAN controls. In: Proceedings of NeurIPS (2020)

    Google Scholar 

  35. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)

    Google Scholar 

  36. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on (2017)

    Google Scholar 

  37. Jahanian, A., Chai, L., Isola, P.: On the “steerability” of generative adversarial networks. In: International Conference on Learning Representations (2020)

    Google Scholar 

  38. Jiang, Y., Koppula, H., Saxena, A.: Hallucinated humans as the hidden context for labeling 3d scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013

    Google Scholar 

  39. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43

    Chapter  Google Scholar 

  40. Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3d human dynamics from video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5614–5623 (2019)

    Google Scholar 

  41. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: International Conference on Learning Representations (2018), https://openreview.net/forum?id=Hk99zCeAb

  42. Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T.: Training generative adversarial networks with limited data. arXiv preprint arXiv:2006.06676 (2020)

  43. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)

    Google Scholar 

  44. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119 (2020)

    Google Scholar 

  45. Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  46. Kingma, D.P., Dhariwal, P.: Glow: generative flow with invertible 1 \(\times \) 1 convolutions. In: NeurIPS, pp. 10236–10245 (2018). http://papers.nips.cc/paper/8224-glow-generative-flow-with-invertible-1x1-convolutions

  47. Koppula, H.S., Gupta, R., Saxena, A.: Learning human activities and object affordances from RGB-d videos. Int. J. Robot. Res. 32(8), 951–970 (2013)

    Article  Google Scholar 

  48. Lee, J., Chai, J., Reitsma, P.S., Hodgins, J.K., Pollard, N.S.: Interactive control of avatars animated with human motion data. In: Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques, pp. 491–500 (2002)

    Google Scholar 

  49. Li, X., Liu, S., Kim, K., Wang, X., Yang, M.H., Kautz, J.: Putting humans in a scene: learning affordance in 3d indoor environments. In: CVPR (2019)

    Google Scholar 

  50. Li, Y., Huang, C., Loy, C.C.: Dense intrinsic appearance flow for human pose transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

    Google Scholar 

  51. Li, Y., Singh, K.K., Ojha, U., Lee, Y.J.: Mixnmatch: multifactor disentanglement and encoding for conditional image generation. In: CVPR (2020)

    Google Scholar 

  52. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision (ICCV), December 2015

    Google Scholar 

  53. Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L.: Pose guided person image generation. In: Advances in Neural Information Processing Systems, pp. 405–415 (2017)

    Google Scholar 

  54. Ma, L., Sun, Q., Georgoulis, S., Van Gool, L., Schiele, B., Fritz, M.: Disentangled person image generation. In: The IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

    Google Scholar 

  55. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)

    MATH  Google Scholar 

  56. Marchesi, M.: Megapixel size image creation using generative adversarial networks (2017)

    Google Scholar 

  57. Mescheder, L., Nowozin, S., Geiger, A.: Which training methods for GANs do actually converge? In: International Conference on Machine Learning (ICML) (2018)

    Google Scholar 

  58. Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)

  59. Mokady, R., et al.: Self-distilled stylegan: towards generation from internet photos. arXiv preprint arXiv:2202.12211 (2022)

  60. Monfort, M., et al.: Moments in time dataset: one million videos for event understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 502–508 (2019)

    Article  Google Scholar 

  61. Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 891–898 (2014)

    Google Scholar 

  62. Park, T., Zhu, J.Y., Wang, O., Lu, J., Shechtman, E., Efros, A.A., Zhang, R.: Swapping autoencoder for deep image manipulation. In: Advances in Neural Information Processing Systems (2020)

    Google Scholar 

  63. Peebles, W., Peebles, J., Zhu, J.Y., Efros, A.A., Torralba, A.: The hessian penalty: a weak prior for unsupervised disentanglement. In: Proceedings of European Conference on Computer Vision (ECCV) (2020)

    Google Scholar 

  64. Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects in context. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. IEEE (2007)

    Google Scholar 

  65. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, pp. 1060–1069. PMLR (2016)

    Google Scholar 

  66. Sauer, A., Schwarz, K., Geiger, A.: Stylegan-xl: scaling stylegan to large diverse datasets. arXiv preprint arXiv:2202.00273 (2022)

  67. Siarohin, A., Sangineto, E., Lathuiliere, S., Sebe, N.: Deformable GANs for pose-based human image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3408–3416 (2018)

    Google Scholar 

  68. Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_31

    Chapter  Google Scholar 

  69. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

    Google Scholar 

  70. Torralba, A., Murphy, K.P., Freeman, W.T., Rubin, M.A.: Context-based vision system for place and object recognition. In: Computer Vision, IEEE International Conference on. vol. 2, pp. 273–273. IEEE Computer Society (2003)

    Google Scholar 

  71. Wang, J., Xu, H., Xu, J., Liu, S., Wang, X.: Synthesizing long-term 3d human motion and interaction in 3d scenes (2020)

    Google Scholar 

  72. Wang, T.C., et al.: Video-to-video synthesis. In: Advances in Neural Information Processing Systems (NeurIPS) (2018)

    Google Scholar 

  73. Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  74. Wang, X., Girdhar, R., Gupta, A.: Binge watching: scaling affordance learning from sitcoms. In: CVPR (2017)

    Google Scholar 

  75. Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2. https://github.com/facebookresearch/detectron2 (2019)

  76. Xu, N., et al.: Youtube-vos: sequence-to-sequence video object segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 585–601 (2018)

    Google Scholar 

  77. Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 17–24 (2010)

    Google Scholar 

  78. Yu, F., Zhang, Y., Song, S., Seff, A., Xiao, J.: Lsun: construction of a large-scale image dataset using deep learning with humans in the loop. CoRR abs/1506.03365 (2015). http://dblp.uni-trier.de/db/journals/corr/corr1506.html#YuZSSX15

  79. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

    Google Scholar 

  80. Zhang, W., Zhu, M., Derpanis, K.G.: From actemes to action: a strongly-supervised representation for detailed action understanding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2248–2255 (2013)

    Google Scholar 

  81. Zhao, S., Liu, Z., Lin, J., Zhu, J.Y., Han, S.: Differentiable augmentation for data-efficient GAN training. arXiv preprint arXiv:2006.10738 (2020)

  82. Zhu, Y., Fathi, A., Fei-Fei, L.: Reasoning about object affordances in a knowledge base representation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 408–424. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_27

    Chapter  Google Scholar 

Download references

Acknowledgements

We thank William Peebles, Ilija Radosavovic, Matthew Tancik, Allan Jabri, Dave Epstein, Lucy Chai, Toru Lin, Shiry Ginosar, Angjoo Kanazawa, Vickie Ye, Karttikeya Mangalam, and Taesung Park for insightful discussion and feedback. Tim Brooks is supported by the National Science Foundation Graduate Research Fellowship under Grant No. 2020306087. Additional support for this project is provided by DARPA MCS, and SAP.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tim Brooks .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 20042 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Brooks, T., Efros, A.A. (2022). Hallucinating Pose-Compatible Scenes. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13676. Springer, Cham. https://doi.org/10.1007/978-3-031-19787-1_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19787-1_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19786-4

  • Online ISBN: 978-3-031-19787-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics