Abstract
This paper presents novel techniques for recovering 3D dense scene flow, based on differential analysis of 4D light fields. The key enabling result is a per-ray linear equation, called the ray flow equation, that relates 3D scene flow to 4D light field gradients. The ray flow equation is invariant to 3D scene structure and applicable to a general class of scenes, but is under-constrained (3 unknowns per equation). Thus, additional constraints must be imposed to recover motion. We develop two families of scene flow algorithms by leveraging the structural similarity between ray flow and optical flow equations: local ‘Lucas–Kanade’ ray flow and global ‘Horn–Schunck’ ray flow, inspired by corresponding optical flow methods. We also develop a combined local–global method by utilizing the correspondence structure in the light fields. We demonstrate high precision 3D scene flow recovery for a wide range of scenarios, including rotation and non-rigid motion. We analyze the theoretical and practical performance limits of the proposed techniques via the light field structure tensor, a \(3 \times 3\) matrix that encodes the local structure of light fields. We envision that the proposed analysis and algorithms will lead to design of future light-field cameras that are optimized for motion sensing, in addition to depth sensing.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Structure tensors have been researched and defined differently in the light field community (e.g., Neumann et al. 2004). Here it is defined by the gradients w.r.t. the 3D motion and is thus a \(3\times 3\) matrix.
Although the structure tensor theoretically has rank 2, the ratio \(\frac{\lambda _1}{\lambda _2}\) of the largest and second largest eigenvalues can be large. This is because the eigenvalue corresponding to Z motion depends on the range of (u, v) coordinates, which is limited by the size of the light field window. Therefore, a sufficiently large window size is required for motion recovery.
References
Adelson, E. H., & Wang, J. Y. A. (1992). Single lens stereo with a plenoptic camera. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 14(2), 99–106.
Alexander, E., Guo, Q., Koppal, S., Gortler, S., & Zickler, T. (2016). Focal flow: Measuring distance and velocity with defocus and differential motion. In European conference on computer vision (ECCV) (pp. 667–682). Heidelberg: Springer.
Aujol, J. F., Gilboa, G., Chan, T., & Osher, S. (2006). Structure-texture image decomposition-modeling, algorithms, and parameter selection. International Journal of Computer Vision (IJCV), 67(1), 111–136. https://doi.org/10.1007/s11263-006-4331-z.
Black, M. J., & Anandan, P. (1996). The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Computer Vision and Image Understanding, 63(1), 75–104.
Bok, Y., Jeon, H. G., & Kweon, I. S. (2017). Geometric calibration of micro-lens-based light field cameras using line features. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(2), 287–300.
Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004). High accuracy optical flow estimation based on a theory for warping. In European conference on computer vision (ECCV) (pp. 25–36).
Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2014). High accuracy optical flow estimation based on a theory for warping. In European conference on computer vision (ECCV) (Vol. 3024, pp. 25–36). Springer.
Bruhn, A., Weickert, J., & Schnörr, C. (2005). Lucas/Kanade meets Horn/Schunck: Combining local and global optic flow methods. International Journal of Computer Vision (IJCV), 61(3), 211–231.
Chandraker, M. (2014a). On shape and material recovery from motion. In European conference on computer vision (ECCV) (pp. 202–217). Heidelberg: Springer.
Chandraker, M. (2014b). What camera motion reveals about shape with unknown BRDF. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2171–2178). Washington: IEEE.
Chandraker, M. (2016). The information available to a moving observer on shape with unknown, isotropic brdfs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(7), 1283–1297.
Dansereau, D. G., Mahon, I., Pizarro, O., & Williams, S. B. (2011). Plenoptic flow: Closed-form visual odometry for light field cameras. In IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 4455–4462). Washington: IEEE.
Dansereau, D. G., Schuster, G., Ford, J., & Wetzstein, G. (2017). A wide-field-of-view monocentric light field camera. In IEEE conference on computer vision and pattern recognition (CVPR). Washington: IEEE.
Gottfried, J. M., Fehr, J., & Garbe, C. S. (2011). Computing range flow from multi-modal kinect data. In International symposium on visual computing (pp. 758–767). Heidelberg: Springer.
Hasinoff, S. W., Durand, F., & Freeman, W. T. (2010). Noise-optimal capture for high dynamic range photography. In: IEEE conference on computer vision and pattern recognition (CVPR) (pp. 553–560). IEEE.
Haussecker, H. W., & Fleet, D. J. (2001). Computing optical flow with physical models of brightness variation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6), 661–673.
Heber, S., & Pock, T. (2014). Scene flow estimation from light fields via the preconditioned primal–dual algorithm. In X. Jiang, J. Hornegger, & R. Koch (Eds.), Pattern recognition (pp. 3–14). Cham: Springer International.
Horn, B. K., & Schunck, B. G. (1981). Determining optical flow. Artificial Intelligence, 17(1–3), 185–203.
Hung, C. H., Xu, L., & Jia, J. (2013). Consistent binocular depth and scene flow with chained temporal profiles. International Journal of Computer Vision (IJCV), 102(1–3), 271–292.
Jaimez, M., Souiai, M., Gonzalez-Jimenez, J., & Cremers, D. (2015). A primal-dual framework for real-time dense rgb-d scene flow. In IEEE international conference on robotics and automation (ICRA) (pp. 98–104). Washington: IEEE.
Jo, K., Gupta, M., & Nayar, S. K. (2015). SpeDo: 6 DOF ego-motion sensor using speckle defocus imaging. InIEEE international conference on computer vision (ICCV) (pp. 4319–4327). Washington: IEEE.
Johannsen, O., Sulc, A., & Goldluecke, B. (2015). On linear structure from motion for light field cameras. In IEEE international conference on computer vision (ICCV) (pp. 720–728). Washington: IEEE.
Letouzey, A., Petit, B., & Boyer, E. (2011). Scene flow from depth and color images. In British machine vision conference (BMVC) (pp. 46–56). BMVA Press.
Levoy, M., & Hanrahan, P. (1996). Light field rendering. In SIGGRAPH conference on computer graphics and interactive techniques (pp. 31–42). New York: ACM.
Li, Z., Xu, Z., Ramamoorthi, R., & Chandraker, M. (2017). Robust energy minimization for BRDF-invariant shape from light fields. In IEEE conference on computer vision and pattern recognition (CVPR) (Vol. 1). Washington: IEEE.
Lucas, B. D., Kanade, T., et al. (1981). An iterative image registration technique with an application to stereo vision. In International joint conference on artificial intelligence (pp. 674–679). San Francisco: Morgan Kaufmann.
Ma, S., Smith, B. M., & Gupta, M. (2018). 3D scene flow from 4D light field gradients. In European conference on computer vision (ECCV) (Vol. 8, pp. 681–698). Springer International Publishing.
Navarro, J., & Garamendi, J. (2016). Variational scene flow and occlusion detection from a light field sequence. In International conference on systems, signals and image processing (IWSSIP) (pp. 1–4). Washington: IEEE.
Neumann, J., Fermuller, C., & Aloimonos, Y. (2003). Polydioptric camera design and 3d motion estimation. In IEEE conference on computer vision and pattern recognition (CVPR) (Vol. 2, pp. II–294). Washington: IEEE.
Neumann, J., Fermüller, C., & Aloimonos, Y. (2004). A hierarchy of cameras for 3d photography. Computer Vision and Image Understanding, 96(3), 274–293.
Ng, R., Levoy, M., Brédif, M., Duval, G., Horowitz, M., & Hanrahan, P. (2005). Light field photography with a hand-held plenoptic camera. Computer Science Technical Report CSTR, 2(11), 1–11.
Odobez, J. M., & Bouthemy, P. (1995). Robust multiresolution estimation of parametric motion models. Journal of Visual Communication and Image Representation, 6(4), 348–365.
Phong, B. T. (1975). Illumination for computer generated pictures. Communications of the ACM, 18(6), 311–317. https://doi.org/10.1145/360825.360839.
Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., et al. (2018). Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. arXiv: 1805.09806 [cs]
Shi, J., & Tomasi, C. (1994). Good features to track. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 593–600). Washington: IEEE.
Smith, B., O’Toole, M., & Gupta, M. (2018). Tracking multiple objects outside the line of sight using speckle imaging. In IEEE conference on computer vision and pattern recognition (CVPR). IEEE.
Smith, B. M., Desai, P., Agarwal, V., & Gupta, M. (2017). CoLux: Multi-object 3d micro-motion analysis using speckle imaging. ACM Transactions on Graphics, 36(4), 1–12.
Srinivasan, P. P., Tao, M. W., Ng, R., & Ramamoorthi, R. (2015). Oriented light-field windows for scene flow. In IEEE international conference on computer vision (ICCV) (pp. 3496–3504). Washington: IEEE.
Sun, D., Roth, S., & Black, M. J. (2010). Secrets of optical flow estimation and their principles. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2432–2439). Washington, IEEE.
Sun, D., Sudderth, E. B., & Pfister, H. (2015). Layered RGBD scene flow estimation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 548–556). Washington: IEEE.
Tao, M. W., Hadap, S., Malik, J., & Ramamoorthi, R. (2013). Depth from combining defocus and correspondence using light-field cameras. In IEEE international conference on computer vision (ICCV) (pp. 673–680). Washington: IEEE.
Vedula, S., Baker, S., Rander, P., Collins, R., & Kanade, T. (1999). Three-dimensional scene flow. In IEEE international conference on computer vision (ICCV) (Vol. 2, pp. 722–729). Washington: IEEE.
Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., & Fragkiadaki, K. (2017). SfM-Net: Learning of structure and motion from video. arXiv:1704.07804 [cs]
Wang, T. C., Chandraker, M., Efros, A. A., & Ramamoorthi, R. (2016). SVBRDF-invariant shape and reflectance estimation from light-field cameras. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5451–5459). Washington: IEEE.
Wanner, S., & Goldluecke, B. (2014). Variational light field analysis for disparity estimation and super-resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(3), 606–619.
Wedel, A., Rabe, C., Vaudrey, T., Brox, T., Franke, U., & Cremers, D. (2008). Efficient dense scene flow from sparse or dense stereo data. In European conference on computer vision (ECCV) (pp. 739–751). Heidelberg: Springer.
Yin, Z., & Shi, J. (2018). GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In IEEE/CVF conference on computer vision and pattern recognition (pp. 1983–1992). Salt Lake City, UT: IEEE. https://doi.org/10.1109/CVPR.2018.00212, https://ieeexplore.ieee.org/document/8578310/
Zhang, Y., Li, Z., Yang, W., Yu, P., Lin, H., & Yu, J. (2017). The light field 3d scanner. In IEEE international conference on computational photography (ICCP) (pp. 1–9). Washington: IEEE.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Yair Weiss.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The authors would like to thank ONR Grant No. N00014-16-1-2995 and Defense Advanced Research Projects Agency Grant No. DARPA REVEAL program for funding this research.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 1 (mp4 23017 KB)
Appendices
Appendix A: Proof of Result 2
Result 2 (Rank of structure tensor) Structure tensor \(\mathbf {S}\) has three possible ranks: 0, 2, and 3 for a local 4D light field window. These correspond to scene patches with no texture (smooth regions), an edge, and 2D texture, respectively.
Proof
We first show the three cases of rank 0, 2 and 3, and then we prove that the structure tensor cannot be rank 1.
Since \(rank(\mathbf {S})=rank(\mathbf {A}^T\mathbf {A})=rank(\mathbf {A})\), we only need to look at the rank of the \(n\times 3\) matrix \(\mathbf {A}\):
Case 1: Smooth region. In this case, \(L_X = L_Y = L_Z = 0\) for all the locations in the light field window. Therefore, all the entries of \(\mathbf {A}\) are zero, resulting in a rank 0 structure tensor. All three eigenvalues \(\lambda _1, \lambda _2, \lambda _3 = 0\). As a result, it has a 3-D null space, and no motion vector can be recovered reliably.
Case 2: Single step edge. Without loss of generality, suppose the light field window corresponds to a fronto-parallel scene patch with a vertical edge, i.e., \(L_Y = 0\) everywhere and thus \(\mathbf {A}_2=\mathbf {0}\).
Consider a point P on the edge. Consider two rays from P that are captured on two horizontally-separated sub-aperture images indexed by \((x_a,y)\) and \((x_b,y)\). Let the coordinates of the two rays be \((x_a,y,u_a,v_a)\) and \((x_b,y,u_b,v_b)\), and let the light field gradients at these two rays be \((L_{Xa},L_{Ya},L_{Za})\) and \((L_{Xb},L_{Yb},L_{Zb})\). Note that these two gradients are two rows in \(\mathbf {A}\). Recall that
Since \(L_{Ya}=0\) and \(L_{Yb}=0\), we have,
Next, suppose there exists \(k\ne 0\) such that \(\mathbf {A}_3=k\mathbf {A}_1\). This implies
By eliminating k from Eq. 19, we get \(u_a=u_b\). However, since the scene point has a finite depth, the disparity \(u_a - u_b \ne 0\). Therefore, \(u_a \ne u_b\). This contradiction means such k doesn’t exist and \(\mathbf {A}_1\) and \(\mathbf {A}_3\) are linearly independent, which means the rank of \(\mathbf {A}\) (and \(\mathbf {S}\)) is 2. As a result \(\mathbf {S}\) has a 1-D null space (only one eigenvalue \(\lambda _3 = 0\)) and a 2D family of motions (motion orthogonal to the edge) can be recovered.
Case 3: 2D texture. In general \(\mathbf {A}_1\), \(\mathbf {A}_2\) and \(\mathbf {A}_3\) are nonzero and independent. The structure tensor is full rank (rank \(=3\)) and the entire space of 3D motions are recoverable.
Now we show that the rank cannot be 1.
(Proof by contradiction) Assume there exists a 4D patch such that its corresponding matrix \(\mathbf {A}\) is rank 1.
First \(\mathbf {A}_1\) and \(\mathbf {A}_2\) cannot both be zero. If we assume they are both zero then according to Eq. 17 all entries in \(\mathbf {A}_3\) will also be zero, which results in a rank 0 matrix. Therefore \(\mathbf {A}_1\ne \mathbf {0}\) or \(\mathbf {A}_2\ne \mathbf {0}\).
Without loss of generality, assume \(\mathbf {A}_1 \ne 0\). Since \(\mathbf {A}\) is rank 1, there exists \(k,l\in \mathbb {R}\) such that
Let us pick a ray \(\mathbf {x_a}=(x_a,y_a,u_a,v_a)\) with light field gradient \((L_{Xa},L_{Ya},L_{Za})\) such that \(L_{Xa}\ne 0\). Such \(\mathbf {x_a}\) exists because \(\mathbf {A}_1\ne 0\). Note that this ray is captured by the sub-aperture image indexed by \((x_a,y_a)\). Assume the scene point corresponding to \(\mathbf {x_a}\) is observed in another sub-aperture image \((x_b,y_b)\) with \(y_a=y_b\), in other words a sub-aperture image that is at the same horizontal line as \((x_a,y_a)\). Denote the corresponding ray as \(\mathbf {x_b}=(x_b,y_a,u_b,v_b)\) with light field gradient \((L_{Xb},L_{Yb},L_{Zb})\). \(L_{Xb}\) is also nonzero.
From Eq. 20 we know that \(L_{Ya}=kL_{Xa}\) and \(L_{Yb}=kL_{Xb}\). According to Eq. 17 we have
From Eq. 21 we know that \(L_{Za}=l L_{Xa}\), \(L_{Zb}=l L_{Xb}\), and from Eqs. 22–23 we have
However since \(x_a\ne x_b\), \(y_a=y_b\), we have \(u_a\ne u_b\) and \(v_a=v_b\) due to simple epipolar geometry. Therefore Eq. 24 cannot hold, which means our assumption is false, and \(rank(\mathbf {A})\) cannot be 1. \(\square \)
Appendix B: Implementation Details
1.1 B.1 Global Method
In Sect. 5, we introduced the global ‘Horn–Schunck’ ray flow method, which solves for the 3D scene motion by minimizing a functional:
This is a convex functional and its minimum can be found by the Euler–Lagrange equations,
These equations can be discretized as a sparse linear system, and is solved using Successive Over-Relaxation (SOR).
1.2 B.2 Structure-Aware Global Method
In this section we discuss an enhanced version of the structure-aware global method, which adopts the enhancing techniques for the local and global method, as discussed in Sect. 6 of the main paper.
1.2.1 B.2.1 Data Term
The data term is defined as:
where \(\mathscr {P}(u,v)\) is the 2D plane defined in Equation (11) in the main paper.
Weighted 2D window Notice that each ray in the 2D plane is given a different weight \(h_i\), which is given by
where \(\mathbf {x_c}\) denotes the center ray of the window. \(d_\alpha =1/\alpha \) and is proportional to the actual depth of the scene point.
\(h_g\) defines a Gaussian weight function that is based on the distance between \(\mathbf {x_i}\) and \(\mathbf {x_c}\) in the 2D plane. \(h_o\) defines an occlusion weight by penalizing the difference in the estimated disparity \(\alpha \) at \(\mathbf {x_i}\) and \(\mathbf {x_c}\). Notice that not all rays on \(\mathscr {P}(u,v)\) corresponds to the same scene point as \(\mathbf {x}_c\) because of occlusion. If the scene point corresponding to \(\mathbf {x}_i\) occludes or is occluded by the scene point corresponding to \(\mathbf {x}_c\), they will have a different \(\alpha \) and thus a small value of \(h_o\).
1.2.2 B.2.2 Smoothness Term
The smoothness term is defined as
where \(V_{X(i)}\) is short for \(\frac{\partial V_X}{\partial u^{(i)}}\). (For simplicity we denote u, v as \(u^{(1)},u^{(2)}\) respectively.) \(g(\mathbf {x})\) is a weight function that varies across the light field. The error term \(E_C(\mathbf {V})\) uses the warp function (Eq. 8) in the main paper.
Best practices from optical flow We choose the penalty function \(\rho _D\), \(\rho _S\) to be the generalized Charbonnier penalty function \(\rho (x^2)=(x^2+\epsilon ^2)^a\) with \(a=0.45\) as suggested in Sun et al. (2010).
Weight function for the regularization term The weight function \(g(\mathbf {x})\) consists of two parts, which is combined using a harmonic mean:
Consistency between XY-motion and Z-motion. In practice we notice that motion discontinuity is preserved better in XY-motion than in Z-motion. To improve the accuracy of the Z-motion, we solve the 3D motion \(\mathbf {V}\) in a two-step process. We compute an initial estimate of the XY-motion, denoted as \(\mathbf {U}=(U_X,U_Y)\), in the first pass. We then use \(\mathbf {U}\) to compute a weight map for the regularization term:
where \(\mathscr {N}(\mathbf {x})\) denotes a local neighborhood of the point \(\mathbf {x}\). Then the full 3D motion \(\mathbf {V}\) is computed in a second pass. Notice that \(g(\mathbf {x})\) is small where gradient of \(\mathbf {U}\) is large, in other words the regularization term will contribute less to the whole energy where there is a discontinuity in \(\mathbf {U}\).
Consistency between motion boundaries and depth boundaries. We also assume the motion boundaries are likely to align with depth boundaries. In other words, we give a lower weight for points where the depth gradient is large:
1.2.3 B.2.3 Optimization
The error term \(E_D(V)\) can be linearized as ,
Then the entire energy \(E'=E_D'+E_S\) can be minimized using Euler–Lagrange equations:
where \(\rho _D'\) is short for \(\rho _D'((L_XV_X+L_YV_Y+L_ZV_Z+L_t)^2)\), \(\delta _L=L_XV_X+L_YV_Y+L_ZV_Z\). Again, these equations are discretized and solved using SOR. The linearization step can then be repeated in an iterative, multi-resolution framework.
Rights and permissions
About this article
Cite this article
Ma, S., Smith, B.M. & Gupta, M. Differential Scene Flow from Light Field Gradients. Int J Comput Vis 128, 679–697 (2020). https://doi.org/10.1007/s11263-019-01230-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-019-01230-z