Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-031-72920-1_16guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Fast View Synthesis of Casual Videos with Soup-of-Planes

Published: 01 October 2024 Publication History

Abstract

Novel view synthesis from an in-the-wild video is difficult due to challenges like scene dynamics and lack of parallax. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render. This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. We treat static and dynamic video content separately. Specifically, we build a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video. Our plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometries. We opt to represent the dynamic content as per-frame point clouds for efficiency. While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. We develop a method to quickly estimate such a hybrid video representation and render novel views in real time. Our experiments show that our method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100× faster in training and enabling real-time rendering. Project page at https://casual-fvs.github.io.

References

[1]
Aliev, K.A., Sevastopolsky, A., Kolos, M., Ulyanov, D., Lempitsky, V.: Neural point-based graphics. In: ECCV (2020)
[2]
Attal, B., Huang, J.B., Richardt, C., Zollhoefer, M., Kopf, J., O’Toole, M., Kim, C.: HyperReel: high-fidelity 6-DoF video with ray-conditioned sampling. In: CVPR (2023)
[3]
Bansal, A., Vo, M., Sheikh, Y., Ramanan, D., Narasimhan, S.: 4d visualization of dynamic events from unconstrained multi-view videos. In: CVPR (2020)
[4]
Bansal, A., Zollhoefer, M.: Neural pixel composition for 3d-4d view synthesis from multi-views. In: CVPR (2023)
[5]
Bemana, M., Myszkowski, K., Seidel, H.P., Ritschel, T.: X-fields: Implicit neural view-, light- and time-image interpolation. In: SIGGRAPH Asia (2020)
[6]
Bian, W., Wang, Z., Li, K., Bian, J.W., Prisacariu, V.A.: Nope-nerf: optimising neural radiance field with no pose prior. In: CVPR (2023)
[7]
Büsching, M., Bengtson, J., Nilsson, D., Björkman, M.: FlowIBR: leveraging pre-training for efficient neural image-based rendering of dynamic scenes. arXiv preprint arXiv:2309.05418 (2023)
[8]
Cao, A., Johnson, J.: HexPlane: a fast representation for dynamic scenes. In: CVPR (2023)
[9]
Cao, A., Rockwell, C., Johnson, J.: FWD: real-time novel view synthesis with forward warping and depth. In: CVPR (2022)
[10]
Das, D., Wewer, C., Yunus, R., Ilg, E., Lenssen, J.E.: Neural parametric gaussians for monocular non-rigid object reconstruction. arXiv preprint arXiv:2312.01196 (2023)
[11]
Flynn, J., et al.: DeepView: view synthesis with learned gradient descent. In: CVPR (2019)
[12]
Fridovich-Keil, S., Meanti, G., Warburg, F.R., Recht, B., Kanazawa, A.: K-planes: explicit radiance fields in space, time, and appearance. In: CVPR (2023)
[13]
Gao, C., Saraf, A., Kopf, J., Huang, J.B.: Dynamic view synthesis from dynamic monocular video. In: ICCV (2021)
[14]
Gao, H., Li, R., Tulsiani, S., Russell, B., Kanazawa, A.: Monocular dynamic view synthesis: a reality check. In: NeurIPS (2022)
[15]
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR (2017)
[16]
Gortler, S.J., Grzeszczuk, R., Szeliski, R., Cohen, M.F.: The lumigraph. In: SIGGRAPH (1996)
[17]
Han, Y., Wang, R., Yang, J.: Single-view view synthesis in the wild with learned adaptive multiplane images. In: SIGGRAPH (2022)
[18]
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV (2017)
[19]
Hu, R., Ravi, N., Berg, A.C., Pathak, D.: Worldsheet: wrapping the world in a 3d sheet for view synthesis from a single image. In: ICCV (2021)
[20]
Huang, Y.H., Sun, Y.T., Yang, Z., Lyu, X., Cao, Y.P., Qi, X.: SC-GS: sparse-controlled gaussian splatting for editable dynamic scenes. arXiv preprint arXiv:2312.14937 (2023)
[21]
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: ECCV (2016)
[22]
Katsumata, K., Vo, D.M., Nakayama, H.: An efficient 3d gaussian representation for monocular/multi-view dynamic scenes. arXiv preprint arXiv:2311.12897 (2023)
[23]
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. In: ACM TOG (2023)
[24]
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
[25]
Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
[26]
Kopf, J., et al.: One shot 3d photography. In: SIGGRAPH (2020)
[27]
Kopf, J., Rong, X., Huang, J.B.: Robust consistent video depth estimation. In: CVPR (2021)
[28]
Kratimenos, A., Lei, J., Daniilidis, K.: DynMF: neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting. arXiV (2023)
[29]
Lee, Y.C., Tseng, K.W., Chen, Y.T., Chen, C.C., Chen, C.S., Hung, Y.P.: 3d video stabilization with depth estimation by CNN-based optimization. In: CVPR (2021)
[30]
Levoy, M., Hanrahan, P.: Light field rendering. In: SIGGRAPH (1996)
[31]
Li, T., et al.: Neural 3d video synthesis from multi-view video. In: CVPR (2022)
[32]
Li, X., Cao, Z., Sun, H., Zhang, J., Xian, K., Lin, G.: 3d cinemagraphy from a single image. In: CVPR (2023)
[33]
Li, Z., Chen, Z., Li, Z., Xu, Y.: Spacetime gaussian feature splatting for real-time dynamic view synthesis. arXiv preprint arXiv:2312.16812 (2023)
[34]
Li, Z., Niklaus, S., Snavely, N., Wang, O.: Neural scene flow fields for space-time view synthesis of dynamic scenes. In: CVPR (2021)
[35]
Li, Z., Wang, Q., Cole, F., Tucker, R., Snavely, N.: DynIBaR: neural dynamic image-based rendering. In: CVPR (2023)
[36]
Liang, Y., et al.: GauFRe: gaussian deformation fields for real-time dynamic novel view synthesis. arXiv preprint arXiv:2312.11458 (2023)
[37]
Lin, C.H., Ma, W.C., Torralba, A., Lucey, S.: Barf: bundle-adjusting neural radiance fields. In: ICCV (2021)
[38]
Lin, H., et al.: High-fidelity and real-time novel view synthesis for dynamic scenes. In: SIGGRAPH Asia Conference Proceedings (2023)
[39]
Lin, K.E., Xiao, L., Liu, F., Yang, G., Ramamoorthi, R.: Deep 3d mask volume for view synthesis of dynamic scenes. In: ICCV (2021)
[40]
Lin, Y., Dai, Z., Zhu, S., Yao, Y.: Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle. arXiv:2312.03431 (2023)
[41]
Lin, Z.H., Ma, W.C., Hsu, H.Y., Wang, Y.C.F., Wang, S.: NeurMiPs: neural mixture of planar experts for view synthesis. In: CVPR (2022)
[42]
Ling, S.Z., Sharp, N., Jacobson, A.: Vectoradam for rotation equivariant geometry optimization. In: NeurIPS (2022)
[43]
Liu, Y.L., et al.: Robust dynamic radiance fields. In: CVPR (2023)
[44]
Lu, E., Cole, F., Dekel, T., Zisserman, A., Freeman, W.T., Rubinstein, M.: Omnimatte: associating objects and their effects in video. In: CVPR (2021)
[45]
Luiten, J., Kopanas, G., Leibe, B., Ramanan, D.: Dynamic 3d gaussians: tracking by persistent dynamic view synthesis. In: 3DV (2024)
[46]
Martin-Brualla, R., Radwan, N., Sajjadi, M.S., Barron, J.T., Dosovitskiy, A., Duckworth, D.: Nerf in the wild: neural radiance fields for unconstrained photo collections. In: CVPR (2021)
[47]
Meuleman, A., et al.: Progressively optimized local radiance fields for robust view synthesis. In: CVPR (2023)
[48]
Mildenhall, B., et al.: Local light field fusion: practical view synthesis with prescriptive sampling guidelines. In: ACM TOG (2019)
[49]
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
[50]
Niklaus, S., Hu, P., Chen, J.: Splatting-based synthesis for video frame interpolation. In: WACV (2023)
[51]
Niklaus, S., Liu, F.: Softmax splatting for video frame interpolation. In: IEEE Conference on Computer Vision and Pattern Recognition (2020)
[52]
Niklaus, S., Mai, L., Yang, J., Liu, F.: 3d ken burns effect from a single image. In: ACM TOG (2019)
[53]
Park, K., et al.: Nerfies: deformable neural radiance fields. In: ICCV (2021)
[54]
Park, K., et al.: HyperNeRF: a higher-dimensional representation for topologically varying neural radiance fields. In: ACM TOG (2021)
[55]
Peng, J., Zhang, J., Luo, X., Lu, H., Xian, K., Cao, Z.: MPIB: an MPI-based bokeh rendering framework for realistic partial occlusion effects. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, ECCV 2022, LNCS, vol. 13666, pp. 590–607. Springer, Cham (2022).
[56]
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR (2016)
[57]
Philip, J., Deschaintre, V.: Floaters no more: radiance field gradient scaling for improved near-camera training. In: Eurographics Symposium on Rendering (2023)
[58]
Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-nerf: neural radiance fields for dynamic scenes. In: CVPR (2021)
[59]
Ramamoorthi, R., Hanrahan, P.: An efficient representation for irradiance environment maps. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 497–500 (2001)
[60]
Ren, Y., Zhang, T., Pollefeys, M., Süsstrunk, S., Wang, F.: VolRecon: volume rendering of signed ray distance functions for generalizable multi-view reconstruction. In: CVPR (2023)
[61]
Rockwell, C., Fouhey, D.F., Johnson, J.: PixelSynth: generating a 3d-consistent experience from a single image. In: ICCV (2021)
[62]
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)
[63]
Shade, J., Gortler, S., He, L.W., Szeliski, R.: Layered depth images. In: Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, pp. 231–242 (1998)
[64]
Shih, M.L., Su, S.Y., Kopf, J., Huang, J.B.: 3d photography using context-aware layered depth inpainting. In: CVPR (2020)
[65]
Sinha, S., Steedly, D., Szeliski, R.: Piecewise planar stereo for image-based rendering. In: ICCV (2009)
[66]
Song, L., et al.: NeRFPlayer: a streamable dynamic scene representation with decomposed neural radiance fields. In: IEEE TVCG (2023)
[67]
Srinivasan, P.P., Tucker, R., Barron, J.T., Ramamoorthi, R., Ng, R., Snavely, N.: Pushing the boundaries of view extrapolation with multiplane images. In: CVPR (2019)
[68]
Stich, T., Linz, C., Albuquerque, G., Magnor, M.: View and time interpolation in image space. In: Computer Graphics Forum (2008)
[69]
Suhail, M., Esteves, C., Sigal, L., Makadia, A.: Light field neural rendering. In: CVPR (2022)
[70]
Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. In: ECCV (2020)
[71]
Tian, F., Du, S., Duan, Y.: MonoNeRF: learning a generalizable dynamic radiance field from monocular videos. In: ICCV (2023)
[72]
Tretschk, E., Tewari, A., Golyanik, V., Zollhöfer, M., Lassner, C., Theobalt, C.: Non-rigid neural radiance fields: reconstruction and novel view synthesis of a dynamic scene from monocular video. In: ICCV (2021)
[73]
Tucker, R., Snavely, N.: Single-view view synthesis with multiplane images. In: CVPR (2020)
[74]
Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., Wang, W.: Neus: learning neural implicit surfaces by volume rendering for multi-view reconstruction. In: NeurIPS (2021)
[75]
Wang, Q., et al.: IBRNet: learning multi-view image-based rendering. In: CVPR (2021)
[76]
Wang, Y., Han, Q., Habermann, M., Daniilidis, K., Theobalt, C., Liu, L.: Neus2: fast learning of neural implicit surfaces for multi-view reconstruction. In: ICCV (2023)
[77]
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
[78]
Wiles, O., Gkioxari, G., Szeliski, R., Johnson, J.: SynSin: end-to-end view synthesis from a single image. In: CVPR (2020)
[79]
Wizadwongsa, S., Phongthawee, P., Yenphraphai, J., Suwajanakorn, S.: NEX: real-time view synthesis with neural basis expansion. In: CVPR (2021)
[80]
Wu, G., et al.: 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528 (2023)
[81]
Xian, W., Huang, J.B., Kopf, J., Kim, C.: Space-time neural irradiance fields for free-viewpoint video. In: CVPR (2021)
[82]
Yang, Z., Yang, H., Pan, Z., Zhang, L.: Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. In: ICLR (2024)
[83]
Yang, Z., Gao, X., Zhou, W., Jiao, S., Zhang, Y., Jin, X.: Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101 (2023)
[84]
Yariv, L., et al.: BakedSDF: Meshing neural SDFs for real-time view synthesis. arXiv preprint arXiv:2302.14859 (2023)
[85]
Yoon, J.S., Kim, K., Gallo, O., Park, H.S., Kautz, J.: Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In: CVPR (2020)
[86]
Zhang, M., Wang, J., Li, X., Huang, Y., Sato, Y., Lu, Y.: Structural multiplane image: bridging neural view synthesis and 3d reconstruction. In: CVPR (2023)
[87]
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
[88]
Zhang, Z., Cole, F., Li, Z., Rubinstein, M., Snavely, N., Freeman, W.T.: Structure and motion from casual videos. In: ECCV (2022)
[89]
Zhang, Z., Cole, F., Tucker, R., Freeman, W.T., Dekel, T.: Consistent depth of moving objects in video. In: ACM TOG (2021)
[90]
Zhao, X., Colburn, A., Ma, F., Bautista, M.A., Susskind, J.M., Schwing, A.G.: Pseudo-generalized dynamic view synthesis from a video. In: ICLR (2024)
[91]
Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: Learning view synthesis using multiplane images. In: SIGGRAPH (2018)
[92]
Zitnick, C.L., Kang, S.B., Uyttendaele, M., Winder, S., Szeliski, R.: High-quality video view interpolation using a layered representation. In: ACM TOG (2004)

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXXVIII
Sep 2024
583 pages
ISBN:978-3-031-72919-5
DOI:10.1007/978-3-031-72920-1
  • Editors:
  • Aleš Leonardis,
  • Elisa Ricci,
  • Stefan Roth,
  • Olga Russakovsky,
  • Torsten Sattler,
  • Gül Varol

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 October 2024

Author Tags

  1. Novel view synthesis
  2. casual video

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media