Unsupervised Learning for Depth, Ego-Motion, and Optical Flow Estimation Using Coupled Consistency Conditions
Abstract
:1. Introduction
1.1. Supervised Learning
1.2. Unsupevised Learning
1.3. The Contribution of This Work
2. Method
2.1. Overview of Method
2.2. Flow Consistency with Depth and Ego-Motion
2.3. Flow Local Consistency
2.4. View Synthesis in Stereo Video
2.5. Loss Function for Training
3. Experimental Results
3.1. Dataset
3.2. Training Details
3.3. Depth Estimation Results
3.4. Optical Flow Estimation Results
3.5. Camera Ego-Motion Estimation Results
4. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Fong, T.; Nourbakhsh, I.; Dautenhahn, K. A survey of socially interactive robot. Robot. Auton. Syst. 2003, 42, 143–166. [Google Scholar] [CrossRef]
- Fraundorfer, F.; Engels, C.; Nister, D. Topological mapping, localization and navigation using image collections. Int. Conf. Intell. Robot. Syst. 2007, 77, 3872–3877. [Google Scholar]
- Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 328–341. [Google Scholar] [CrossRef] [PubMed]
- Hernandez, D.; Chacon, A.; Espinosa, A.; Vazquez, D.; Moure, J.; Lopze, A. Embedded real-time stereo estimation via semi-global matching on the GPU. Procedia Comput. Sci. 2016, 80, 143–153. [Google Scholar] [CrossRef]
- Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. DTAM: Dense tracking and mapping in real-time. In Proceedings of the 2011 International Conference on Computer Vision (ICCV 2011), Barcelona, Spain, 6–13 November 2011; pp. 2320–2327. [Google Scholar]
- Agrawal, P.; Carreira, J.; Malik, J. Learning to see by moving. In Proceedings of the 2015 International Conference on Computer Vision (ICCV 2015), Santiago, Chile, 13–16 December 2015; pp. 37–45. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 2014 Neural Information Processing Systems (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; pp. 2366–2374. [Google Scholar]
- Furukawa, W.; Curless, B.; Seitz, S.M.; Szeliski, R. Towards internet-scale multi-view stereo. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2010), San Francisco, CA, USA, 13–18 June 2010; pp. 1434–1441. [Google Scholar]
- Triggs, B.; McLauchlan, P.F.; Hartley, R.I.; Fitzgibbon, A.W. Bundle adjustment—A modern synthesis. In International Workshop on Vision Algorithms; Springer: Berlin/Heidelberg, Germany, 2002; pp. 298–372. [Google Scholar]
- Kendall, A.; Grimes, M.; Cipolla, R. PoseNet: A convolutional network for real-time 6-DoF camera relocalization. In Proceedings of the 2015 International Conference on Computer Vision (ICCV 2015), Santiago, Chile, 13–16 December 2015; pp. 2938–2946. [Google Scholar]
- Wang, S.; Clark, R.; Wen, H.; Trigoni, N. Deepvo: Toward end-to-end visual odometry with deep recurrent convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA 2017), Singapore, 29 May–3 June 2017; pp. 2043–2050. [Google Scholar]
- Ladicky, L.; Zeisl, B.; Pollefeys, M. Discriminatively trained dense surface normal estimation. In Proceedings of the 13th European Conference on Computer Vision (ECCV 2014), Zurich, Switzerland, 6–12 September 2014; pp. 468–484. [Google Scholar]
- Wang, P.; Shen, X.; Lin, Z.; Cohen, S.; Price, B.L.; Yuille, A.L. Towards unified depth and semantic prediction from a single image. In Proceedings of the 28th IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 8–10 June 2015; pp. 2800–2809. [Google Scholar]
- Garg, R.; BG, V.K.; Carneiro, G.; Reid, I. Unsupervised cnn for single view depth estimation. In Proceedings of the 14th European Conference on Computer Vision (ECCV 2016), Amsterdam, The Netherlands, 8–16 October 2016; pp. 740–756. [Google Scholar]
- Xie, J.; Girshick, R.B.; Farhadi, A. Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks. In Proceedings of the 14th European Conference on Computer Vision (ECCV 2016), Amsterdam, The Netherlands, 8–16 October 2016; pp. 842–857. [Google Scholar]
- Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 4th International Conference on 3D Vision (3dv 2016), Stanford, CA, USA, 25 October 2016; pp. 239–248. [Google Scholar]
- Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, P.; Bachrach, A.; Bry, A. End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the 2017 International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; pp. 66–75. [Google Scholar]
- Jason, J.Y.; Harley, A.W.; Derpanis, K.G. Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In Proceedings of the 14th European Conference on Computer Vision (ECCV 2016), Amsterdam, The Netherlands, 8–16 October 2016; pp. 3–10. [Google Scholar]
- Vijayanarasimhan, S.; Ricco, S.; Schmid, C.; Sukthankar, R.; Fragkiadaki, K. Sfm-net: Learning of structure and motion from video. arXiv 2017, arXiv:1704.07804. [Google Scholar]
- Zhou, T.; Brow, M.; Snavely, N.; Lowe, D. Unsupervised learning of depth and ego-motion from video. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2014; pp. 6612–6621. [Google Scholar]
- Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las vegas, NV, USA, 27–30 June 2016; pp. 270–279. [Google Scholar]
- Wang, Y.; Xu, Y. Unsupervised learning of accurate camera pose and depth from video sequences with Kalman filter. IEEE Access 2019, 7, 32796–32804. [Google Scholar] [CrossRef]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Flynn, J.; Neulander, I.; Philbin, J.; Snavely, N. Deep stereo: Learning to predict new views from the world’s imagery. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las vegas, NV, USA, 27–30 June 2016; pp. 5515–5524. [Google Scholar]
- Ian, G.; Jean, P.; Mehdi, M.; Bing, X.; David, W.; Sherjil, O.; Aaron, C.; Yoshua, B. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistency adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
- Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. Adv. Neural Info. Process. Syst. 2016, 28, 2017–2025. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Proc. 2004, 13, 600–612. [Google Scholar] [CrossRef]
- Meister, S.; Hur, J.; Roth, S. UnFlow: Unsupervised learning of optical flow with a bidirectional census loss. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 1–9. [Google Scholar]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; De Vito, Z.; Lin, Z.; Lin, Z.; Desmaison, A.; Antiaga, L.; et al. Automatic differentiation in PyTorch. In Proceedings of the 2017 International Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 1–4. [Google Scholar]
- Kingma, D.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 13th International Conference on Learning Representations (ICLR 2014), Detroit, MI, USA, 3–5 December 2014; pp. 1–14. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
- Nair, V.; Hinton, G.E. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML 2010), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Dosovitskiy, A.; Fischer, P.; Ilg, E.; Hausser, P.; Haziirbas, C.; Golkov, V.; van der Smagt, P.; Cremers, D.; Brox, T. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2015), Los Condes, Chile, 11–18 December 2015; pp. 2758–2766. [Google Scholar]
- Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; Brox, T. FlowNet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las vegas, NV, USA, 27–30 June 2016; pp. 1647–1655. [Google Scholar]
- Mayer, N.; Ilg, E.; Hausser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las vegas, NV, USA, 27–30 June 2016; pp. 4040–4048. [Google Scholar]
- Ren, J.; Yan, J.; Ni, B.; Liu, B.; Yang, X.; Zha, H. Unsupervised deep learning for optical flow estimation. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI 2017), San Francisco, CA, USA, 4–9 February 2017; pp. 1495–1501. [Google Scholar]
- Thomas, B.; Andres, B.; Nils, P.; Joachim, W. High accuracy optical flow estimation based on a theory for warping. In Proceedings of the European Conference on Computer Vision (ECCV 2004), Zurich, Switzerland, 6–12 September 2004; pp. 25–36. [Google Scholar]
- Mur-Arta, R.; Tards, J.D.; Montiel, J.M.M.; Glvez-Lpez, D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
- Yin, Z.; Shi, J. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1983–1992. [Google Scholar]
Method | Dataset | Error Metric | Accuracy Metric | |||||
---|---|---|---|---|---|---|---|---|
Abs Rel | Sq Rel | RMSE | RMSE log | |||||
Zhou et al. [20] | K (U) | 0.208 | 1.768 | 6.856 | 0.283 | 0.678 | 0.885 | 0.957 |
Eigen et al. [7] Coarse | K (D) | 0.214 | 1.605 | 6.563 | 0.292 | 0.673 | 0.884 | 0.957 |
Eigen et al. [7] Fine | K(D) | 0.203 | 1.548 | 6.307 | 0.282 | 0.702 | 0.890 | 0.958 |
Godard et al. [21] | K (P) | 0.148 | 1.344 | 5.972 | 0.247 | 0.803 | 0.922 | 0.964 |
Garg et al. [14] | K (P) | 0.169 | 1.080 | 5.104 | 0.273 | 0.704 | 0.904 | 0.962 |
Wang et al. [22] | K (U) | 0.154 | 1.333 | 5.996 | 0.251 | 0.782 | 0.916 | 0.963 |
Ours (w/o depth smooth) | K (U) | 0.183 | 1.442 | 5.289 | 0.264 | 0.686 | 0.891 | 0.955 |
Ours (w/o synt. cons.) | K(U) | 0.171 | 1.597 | 5.337 | 0.252 | 0.692 | 0.898 | 0.955 |
Ours (w/o flow cons.) | K (U) | 0.158 | 1.514 | 5.293 | 0.271 | 0.694 | 0.888 | 0.951 |
Ours | K (U) | 0.143 | 1.328 | 5.102 | 0.244 | 0.803 | 0.930 | 0.960 |
Method | Dataset | KITTI Flow 2012 | KITTI Flow 2015 | Runtime (ms) | |||||
---|---|---|---|---|---|---|---|---|---|
EPE (Train) | EPE (Test) | Non-Oc | EPE (Train) | AEE (Train) | Fl (Train) | Non-Oc | GPU | ||
FlowNetC [35] | C (G) | 9.35 | - | 7.23 | 12.52 | - | 47.93% | 9.35 | 51.4 |
FlowNetS [35] | C (G) | 8.26 | - | 6.85 | 15.44 | - | 52.86% | 8.12 | 20.2 |
DSTFlow [38] | K (U) | 10.43 | 12.4 | - | 16.79 | 14.61 | 36.00% | - | - |
FlowNet2 [36] | C (G) + T (G) | 4.09 | - | 3.42 | 10.06 | 9.17 | 30.37% | 4.93 | 101.6 |
Ours (w/o flow cons.) | K (U) | 5.33 | 4.72 | 6.21 | 11.28 | 10.11 | 35.42% | 5.13 | 38.1 |
Ours (w/o synt cons.) | K (U) | 5.02 | 4.55 | 6.69 | 10.74 | 9.58 | 34.17% | 5.08 | 38.2 |
Ours (w/o FLC) | K (U) | 4.41 | 4.38 | 5.45 | 10.33 | 9.37 | 31.64% | 4.96 | 40.4 |
Ours | K (U) | 4.16 | 4.07 | 4.52 | 10.05 | 9.12 | 29.78% | 4.81 | 42.8 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mun, J.-H.; Jeon, M.; Lee, B.-G. Unsupervised Learning for Depth, Ego-Motion, and Optical Flow Estimation Using Coupled Consistency Conditions. Sensors 2019, 19, 2459. https://doi.org/10.3390/s19112459
Mun J-H, Jeon M, Lee B-G. Unsupervised Learning for Depth, Ego-Motion, and Optical Flow Estimation Using Coupled Consistency Conditions. Sensors. 2019; 19(11):2459. https://doi.org/10.3390/s19112459
Chicago/Turabian StyleMun, Ji-Hun, Moongu Jeon, and Byung-Geun Lee. 2019. "Unsupervised Learning for Depth, Ego-Motion, and Optical Flow Estimation Using Coupled Consistency Conditions" Sensors 19, no. 11: 2459. https://doi.org/10.3390/s19112459