Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Novel View Synthesis from a Single Unposed Image via Unsupervised Learning

Published: 31 May 2023 Publication History

Abstract

Novel view synthesis aims to generate novel views from one or more given source views. Although existing methods have achieved promising performance, they usually require paired views with different poses to learn a pixel transformation. This article proposes an unsupervised network to learn such a pixel transformation from a single source image. In particular, the network consists of a token transformation module that facilities the transformation of the features extracted from a source image into an intrinsic representation with respect to a pre-defined reference pose and a view generation module that synthesizes an arbitrary view from the representation. The learned transformation allows us to synthesize a novel view from any single source image of an unknown pose. Experiments on the widely used view synthesis datasets have demonstrated that the proposed network is able to produce comparable results to the state-of-the-art methods despite the fact that learning is unsupervised and only a single source image is required for generating a novel view. The code will be available upon the acceptance of the article.

References

[1]
Pradeep K. Atrey, Bakul Trehan, and Mukesh K. Saini. 2019. Watch me from distance (WMD): A privacy-preserving long-distance video surveillance system. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2 (2019), 1–18.
[2]
Saddam Bekhet and Amr Ahmed. 2018. An integrated signature-based framework for efficient visual similarity detection and measurement in video shots. ACM Trans. Inf. Syst. 36, 4 (2018), 1–38.
[3]
Pablo Carballeira, Carlos Carmona, César Díaz, Daniel Berjón, Daniel Corregidor, and Julián Cabrera et al. 2022. FVV Live: A real-time free-viewpoint video system with consumer electronics hardware. IEEE Trans. Multimedia. 24 (2022), 2378–2391.
[4]
Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Li Yi Jianxiong Xiao, and Fisher Yu. 2015. Shapenet: An information-rich 3d model repository. arXiv:1512.03012. Retrieved from https://arxiv.org/abs/1512.03012.
[5]
Bin Chen, Lingyan Ruan, and Miu-Ling Lam. 2020. LFGAN: 4D light field synthesis from a single RGB image. ACM Trans. Multimedia Comput. Commun. Appl. 16, 1 (2020), 1–20.
[6]
Duo Chen, Xinzhu Sang, Peng Wang, Xunbo Yu, Xin Gao, Binbin Yan, Huachun Wang, Shuai Qi, and Xiaoqian Ye. 2021. Virtual view synthesis for 3D light-field display based on scene tower blending. Opt. Express. 29, 5 (2021), 7866–7884.
[7]
Shu Chen, Zhengdong Pu, Xiang Fan, and Beiji Zou. 2022. Fixing defect of photometric loss for self-supervised monocular depth estimation. IEEE Trans. Circ. Syst. Vid. Technol. 32, 3 (2022), 1328–1338.
[8]
Xu Chen, Jie Song, and Otmar Hilliges. 2019. Monocular neural image based rendering with continuous view control. In Proceedings of the IEEE International Conference on Computer Vision.4090–4100.
[9]
Christopher B. Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 2016. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Proceedings of the European Conference on Computer Vision.628–644.
[10]
Jeroen Van der Hooft, Maria Torres Vega, Stefano Petrangeli, Tim Wauters, and Filip De Turck. 2019. Tile-based adaptive streaming for virtual reality video. ACM Trans. Multimedia Comput. Commun. Appl. 15, 4 (2019), 1–24.
[11]
Kingma Diederik and Ba Jimmy. 2015. Adam: A method for stochastic optimization. In Proceedings of the 34th International Conference on Machine Learning.
[12]
Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. 2015. Learning to generate chairs with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.1538–1546.
[13]
S. M. Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S. Morcos, Marta Garnelo, et al. 2018. Neural scene representation and rendering. Science 360, 6394 (2018), 1204–1210.
[14]
Yuming Fang, Junle Wang, Manish Narwaria, Patrick Le Callet, and Weisi Lin. 2014. Saliency detection for stereoscopic images. IEEE Trans. Image Process. 23, 6 (2014), 2625–2636.
[15]
Wei Gao, Linjie Zhou, and Lvfang Tao. 2021. A fast view synthesis implementation method for light field applications. ACM Trans. Multimedia Comput. Commun. Appl. 17, 4 (2021), 1–20.
[16]
Rohit Girdhar, David F. Fouhey, Mikel Rodriguez, and Abhinav Gupta. 2016. Learning a predictable and generative vector representation for objects. In Proceedings of the European Conference on Computer Vision.484–499.
[17]
Ralph Gross, Iain Matthews, Jeffrey Cohnb, Takeo Kanadea, and Simon Bakerc. 2010. Multi-pie. Image Vis. Comput. 28, 5 (2010), 807–813.
[18]
Jiaxian Guo, Jiachen Li, Huan Fu, Mingming Gong, Kun Zhang, and Dacheng Tao. 2022. Alleviating semantics distortion in unsupervised low-level image-to-image translation via structure consistency constraint. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 18228–18238.
[19]
Pengsheng Guo, Miguel Angel Bautista, Alex Colburn, Liang Yang, Daniel Ulbricht, Joshua M. Susskind, and Qi Shan. 2022. Fast and explicit neural view synthesis. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision.3791–3800.
[20]
Tewodros Habtegebrial, Kiran Varanasi, Christian Bailer, and Didier Stricker. 2019. Fast view synthesis with deep stereo vision. In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. 792–799.
[21]
Nicolai Hani, Selim Engin, Jun-Jee Chao, and Volkan Isler. 2020. Continuous object representation networks: Novel view synthesis without target view supervision. In Advances in Neural Information Processing Systems.6086–6099.
[22]
Paul Henderson and Vittorio Ferrari. 2020. Learning single-image 3d reconstruction by generative modelling of shape, pose and shading. Int. J. Comput. Vis. 128, 4 (2020), 835–854.
[23]
Yuxin Hou, Arno Solin, and Juho Kannala. 2021. Novel view synthesis via depth-guided skip connections. In Proceedings of the Winter Conference on Applications of Computer Vision.3118–3127.
[24]
Ronghang Hu, Nikhila Ravi, Alexander C. Berg, and Deepak Pathak. 2021. Worldsheet: Wrapping the world in a 3D sheet for view synthesis from a single image. In Proceedings of the IEEE International Conference on Computer Vision.12528–12537.
[25]
Jialu Huang, Jing Liao, and Sam Kwong. 2022. Unsupervised image-to-image translation via pre-trained styleGAN2 network. IEEE Trans. Multimedia 24 (2022), 1435–1448.
[26]
Lei Jiang, Haibin Cai, Gerald Schaefer, and Qinggang Meng. 2022. A neural refinement network for single image view synthesis. Neurocomputing 496 (2022), 35–45.
[27]
Abhishek Kar, Christian Hne, and Jitendra Malik. 2017. Learning a multi-view stereo machine. In Advances in Neural Information Processing Systems.364–375.
[28]
Gou Koutaki. 2016. Binary continuous image decomposition for multi-view display. ACM Trans. Graph. 35, 4 (2016), 1–12.
[29]
Youngjoong Kwon, Stefano Petrangeli, Dahun Kim, Haoliang Wang, Eunbyung Park, Viswanathan Swaminathan, and Henry Fuchs. 2020. Rotationally-temporally consistent novel view synthesis of human performance video. In Proceedings of the European Conference on Computer Vision.387–402.
[30]
Kusam Lata, Mayank Dave, and K. N. Nishanth. 2019. Image-to-image translation using generative adversarial network. In Proceedings of the International Conference on Electronics, Communication and Aerospace Technology.186–189.
[31]
Jianjun Lei, Jiahui Song, Bo Peng, Wanqing Li, Zhaoqing Pan, and Qingming Huang. 2022. C2FNet: A coarse-to-fine network for multi-view 3D point cloud generation. IEEE Trans. Image Process. 31 (2022), 6707–6718.
[32]
Chen-Hsuan Lin, Chen Kong, and Simon Lucey. 2018. Learning efficient point cloud generation for dense 3d object reconstruction. In Proceedings of the AAAI Conference on Artificial Intelligence.
[33]
Chen-Hsuan Lin, Chaoyang Wang, and Simon Lucey. 2020. SDF-SRN: Learning signed distance 3D object reconstruction from static images. In Advances in Neural Information Processing Systems.11453–11464.
[34]
Caixia Liu, Dehui Kong, Shaofan Wang, Jinghua Li, and Baocai Yin. 2022. A spatial relationship preserving adversarial network for 3D reconstruction from a single depth view. ACM Trans. Multimedia Comput. Commun. Appl. 18, 4 (2022), 1–22.
[35]
Miaomiao Liu, Xuming He, and Mathieu Salzmann. 2018. Geometry-aware deep network for single-image novel view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.4616–4624.
[36]
Xiaofeng Liu, Tong Che, Yiqun Lu, Chao Yang, Site Li, and Jane You. 2020. AUTO3D: Novel view synthesis through unsupervisely learned variational view and global 3D representation. In Proceedings of the European Conference on Computer Vision.52–71.
[37]
Tatarchenko Maxim, Alexey Dosovitskiy, and Thomas Brox. 2015. Single-view to multi-view: Reconstructing unseen views with a convolutional network. Knowl. Inf. Syst. 38, 1 (2015), 231–257.
[38]
Thu H. Nguyen-Phuoc, Chuan Li, Stephen Balaban, and Yongliang Yang. 2018. A deep convolutional network for differentiable rendering from 3d shapes. In Advances in Neural Information Processing Systems.7891–7901.
[39]
Kyle Olszewski, Sergey Tulyakov, Oliver Woodford, Hao Li, and Linjie Luo. 2019. Transformable bottleneck networks. In Proceedings of the IEEE International Conference on Computer Vision.7647–7656.
[40]
Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan, and Alexander C. Berg. 2017. Transformation-grounded image generation network for novel 3D view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.702–711.
[41]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2015. Automatic differentiation in pytorch. In Advances in Neural Information Processing Systems Workshop.
[42]
Bo Peng, Mingliang Zhang, Jianjun Lei, Huazhu Fu, Haifeng Shen, and Qingming Huang. 2022. RGB-D human matting: A real-world benchmark dataset and a baseline method. IEEE Trans. Circ. Syst. Vid. Technol. (2022). DOI:
[43]
Bo Peng, Xuanyu Zhang, Jianjun Lei, Zhe Zhang, Nam Ling, and Qingming Huang. 2022. LVE-S2D: Low-light video enhancement from static to dynamic. IEEE Trans. Circ. Syst. Vid. Technol. 32 (2022), 8342–8352.
[44]
Jhony K. Pontes, Chen Kong, Sridha Sridharan, Simon Lucey, Anders Eriksson, and Clinton Fookes. 2019. Image2mesh: A learning framework for single image 3d reconstruction. In Proceedings of the Asian Conference on Computer Vision.365–381.
[45]
Pierluigi Zama Ramirez, Diego Martin Arroyo, Alessio Tonioni, and Federico Tombari. 2021. Unsupervised novel view synthesis from a single image. arXiv:2102.03285. Retrieved from https://arxiv.org/abs/2102.03285.
[46]
Konstantinos Rematas, Chuong H. Nguyen, Tobias Ritschel, Mario Fritz, and Tinne Tuytelaars. 2017. Novel views of objects from a single image. IEEE Trans. Pattern Anal. Mach. Intell. 39, 8 (2017), 1576–1590.
[47]
Roger N. Shepard and Jacqueline Metzler. 1971. Mental rotation of three-dimensional objects. Science 171, 3972 (1971), 701–703.
[48]
Shaohua Sun, Minyoung Huh, Yuanhong Liao, Ning Zhang, and Joseph J. Lim. 2018. Multi-view to novel view: Synthesizing novel views with self-learned confidence. In Proceedings of the European Conference on Computer Vision.155–171.
[49]
Fangzheng Tian, Yongbin Gao, Zhijun Fang, Yuming Fang, Jia Gu, Hamido Fujita, and Jenq-Neng Hwang. 2022. Depth estimation using a self-supervised network based on cross-layer feature fusion and the quadtree constraint. IEEE Trans. Circ. Syst. Vid. Technol. 32, 4 (2022), 1751–1766.
[50]
Luan Tran, Xi Yin, and Xiaoming Liu. 2017. Disentangled representation learning GAN for pose-invariant face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.1283–1292.
[51]
Shubham Tulsiani, Richard Tucker, and Noah Snavely. 2018. Layer-structured 3d scene inference via view synthesis. In Proceedings of the European Conference on Computer Vision.302–317.
[52]
Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. 2018. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.1526–1535.
[53]
Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, and Eero P. Simoncelli. 2004. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 13, 4 (2004), 600–612.
[54]
Olivia Wiles, Georgia Gkioxari, Richard Szeliski, and Justin Johnson. 2020. SynSin: End-to-end view synthesis from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.7467–7477.
[55]
Junyuan Xie, Ross Girshick, and Ali Farhadi. 2016. Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks. In Proceedings of the European Conference on Computer Vision.842–857.
[56]
Xiaogang Xu, Ying-Cong Chen, and Jiaya Jia. 2019. View independent generative adversarial network for novel view synthesis. In Proceedings of the IEEE International Conference on Computer Vision.7790–7799.
[57]
Zexiang Xu, Sai Bi, Kalyan Sunkavalli, Sunil Hadap, Hao Su, and Ravi Ramamoorthi. 2019. Deep view synthesis from sparse photometric images. ACM Trans. Graph. 38, 4 (2019), 1–13.
[58]
Jiebin Yan, Jing Li, Yuming Fang, Zhaohui Che, Xue Xia, and Yang Liu. 2022. Subjective and objective quality of experience of free viewpoint videos. IEEE Trans. Image Process. 31 (2022), 3896–3907.
[59]
Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. 2016. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In Advances in Neural Information Processing Systems.1696–1704.
[60]
Mingyu Yin, Li Sun, and Qingli Li. 2020. Novel view synthesis on unpaired data by conditional deformable variational auto-encoder. In Proceedings of the European Conference on Computer Vision.87–103.
[61]
Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. 2021. Pixelnerf: Neural radiance fields from one or few images. In Proceedings of the Computer Vision and Pattern Recognition.4578–4587.
[62]
Ziqiang Zheng, Yi Bin, Xiaoou Lu, Yang Wu, Yang Yang, and Heng Tao Shen. 2022. Asynchronous generative adversarial network for asymmetric unpaired image-to-image translation. IEEE Trans. Multimedia (2022). DOI:
[63]
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. 2018. Stereo magnification: Learning view synthesis using multiplane images. ACM Trans. Graph. 37, 4 (2018), 1–12.
[64]
Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A. Efros. 2016. View synthesis by appearance flow. In Proceedings of the European Conference on Computer Vision.286–301.
[65]
Hao Zhu, Hao Su, Peng Wang, Xun Cao, and Ruigang Yang. 2018. View extrapolation of human body from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.4450–4459.
[66]
C. Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and Richard Szeliski. 2004. High-quality video view interpolation using a layered representation. ACM Trans. Graph. 23, 3 (2004), 600–608.

Cited By

View all
  • (2024)Occupancy Map Guided Attributes Artifacts Removal for Video-based Point Cloud CompressionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3697351Online publication date: 25-Sep-2024
  • (2024)Decoupling Deep Learning for Enhanced Image Recognition InterpretabilityACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367483720:10(1-24)Online publication date: 10-Jul-2024
  • (2024)A Low-Density Parity-Check Coding Scheme for LoRa NetworkingACM Transactions on Sensor Networks10.1145/366592820:4(1-29)Online publication date: 8-Jul-2024
  • Show More Cited By

Index Terms

  1. Novel View Synthesis from a Single Unposed Image via Unsupervised Learning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 6
    November 2023
    858 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3599695
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 May 2023
    Online AM: 11 March 2023
    Accepted: 26 February 2023
    Revised: 17 January 2023
    Received: 03 June 2022
    Published in TOMM Volume 19, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Multimedia applications
    2. 3D display
    3. unsupervised single-view synthesis
    4. token transformation module
    5. view generation module

    Qualifiers

    • Research-article

    Funding Sources

    • National Key R&D Program of China
    • National Natural Science Foundation of China
    • Natural Science Foundation of Tianjin
    • China Postdoctoral Science Foundation

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)229
    • Downloads (Last 6 weeks)24
    Reflects downloads up to 14 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Occupancy Map Guided Attributes Artifacts Removal for Video-based Point Cloud CompressionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3697351Online publication date: 25-Sep-2024
    • (2024)Decoupling Deep Learning for Enhanced Image Recognition InterpretabilityACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367483720:10(1-24)Online publication date: 10-Jul-2024
    • (2024)A Low-Density Parity-Check Coding Scheme for LoRa NetworkingACM Transactions on Sensor Networks10.1145/366592820:4(1-29)Online publication date: 8-Jul-2024
    • (2024)Spatiotemporal Inconsistency Learning and Interactive Fusion for Deepfake Video DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3664654Online publication date: 13-May-2024
    • (2024)Blind Quality Assessment of Dense 3D Point Clouds with Structure Guided ResamplingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366419920:8(1-21)Online publication date: 13-Jun-2024
    • (2024)Self-Supervised Monocular Depth Estimation via Binocular Geometric Correlation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366357020:8(1-19)Online publication date: 13-Jun-2024
    • (2024)Progressive Adapting and Pruning: Domain-Incremental Learning for Saliency PredictionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366131220:8(1-243)Online publication date: 13-Jun-2024
    • (2024)Dual Dynamic Threshold Adjustment StrategyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604720:7(1-18)Online publication date: 15-May-2024
    • (2024)Backdoor Two-Stream Video Models on Federated LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3651307Online publication date: 7-Mar-2024
    • (2024)Jointly Harnessing Prior Structures and Temporal Consistency for Sign Language Video GenerationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364836820:6(1-18)Online publication date: 26-Mar-2024
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media