Monocular Depth by Nonlinear Diffusion

Mariella Dimiccoli

Sixth Indian Conference on Computer Vision, Graphics & Image Processing Monocular Depth by NonLinear Diffusion Mariella Dimiccoli † , Jean-Michel Morel ‡ and Philippe Salembier † ‡ † Technical University of Catalonia Dept. of Signal Theory and Com. Jordi Girona 1-3, Barcelona, Spain mariella,philippe@gps.tsc.upc.edu Abstract cept before any high level cognitive process. In the founding Wertheimer paper [32], one can distinguish two kinds of grouping laws. The first kind are elementary grouping laws that start from the atomic local level to recursively construct larger and larger groups (gestalts). The second kind are principles governing the interaction, collaborative or conflictive, between partial gestalts obtained by elementary grouping laws. In a broad overview of Gestalt theory, Metzger [20] showed that depth can be perceived in the absence of binocular correspondence. Although these results were well known at the time computer vision emerged as a new discipline, a great deal of effort has been invested by the computer vision community in coming up with algorithms to recover depth from stereo[18] and from other cues that requires multiples images, such as structure from motion [12] or depth from defocus [23]. More recently, several works on monocular depth perception are focusing on learning approaches that capture contextual information [25, 27, 13] and still involve more neurophysiology than phenomenology. To the best of our knowledge, the laws governing the primary process of depth perception, as opposed to a more cognitive secondary process have still not received an adequate mathematical and computational translation. This lack is mainly due to the qualitative nature of phenomenology. The mathematical definition of digital image was ignored by Gestaltists and the related issues of blur and noise in image formation were even not qualitatively considered. In this paper we attempt a mathematical and computational translation of gestalt laws and principles governing the monocular perception of depth, from the detection of sparse monocular depth cues such as T- and Xjunctions and the local convexity to their synthesis into a global depth reconstruction. Following the phenomenological approach of gestaltists, sparse monocular depth cues such as T- and X-junctions and the local convexity are crucial to identify the shape and depth relationships of depicted objects. According to Kanizsa, mechanisms called amodal and modal completion permit to transform these local relative depth cues into a global depth reconstruction. In this paper, we propose a mathematical and computational translation of gestalt depth perception theory, from the detection of local depth cues to their synthesis into a consistent global depth perception. The detection of local depth cues is built on the response of a line segment detector (LSD), which works in a linear time relative to the image size without any parameter tuning. The depth synthesis process is based on the use of a nonlinear iterative filter which is asymptotically equivalent to the Perona-Malik partial differential equation (PDE). Experimental results are shown on several real images and demonstrate that this simple approach can account a variety of phenomena such as visual completion, transparency and self-occlusion. 1. Introduction To infer the shape and the distance from the viewpoint of depicted objects, our visual system is influenced by several factors, commonly referred to as pictorial depth cues because of their use by artists to convey a greater sense of depth in a flat medium. The whole issue of how these factors are grouped together by the visual system to convey an unique, stable depth perception is what Kanizsa [14] called the more general ”enigma of perception”. Gestalt theory was a first scientific attempt to address this fundamental issue. Gestaltists consider human perception as the result of a construction process driven by a set of elementary grouping laws. These laws are supposed to act for every new per- 978-0-7695-3476-3/08 $25.00 © 2008 IEEE DOI 10.1109/ICVGIP.2008.97 Superior Normal School of Cachan Dept. of Applied Mathematics Pr. Wilson 61, Cachan, France morel@cmla.ens-cachan.fr In the next section we survey the literature related to the subject. In Section 3 we give a detailed description of the proposed approach to monocular depth perception. In Section 4 we discuss the experimental results and finally section 5 reports the main conclusions of the present work. 95 2. Related work of local surface interpretations to local occlusion cues, such as junctions and corners, is assigned in the form of salient surface-states. Then, a linear diffusion algorithm that block diffusion coefficients at intensity edges is applied. The best image organization is selected based on a coherence measure between pairs of junctions. A more neurophysiological approach is taken by Kogo et al. [16] and Mordohai et al.[21]. [16] proposed a feedback model based on a surface completion scheme. The relative depths are determined by convolution of Gaussian derivative based filters, while an anisotropic diffusion equation [24] reconstructs the surfaces. [21] integrated under a tensor voting framework first and second order information for automatic junction labeling and selection between modal and modal completion. Recently, Gao et al. [8] proposed a Bayesian inference framework which unify the contourbased and the region-based perspective. T-junctions are computed on atomic regions and broken into terminators. A graph representation is obtained consisting of two types of nodes: atomic regions and its corresponding terminators that make the problem a mixed MRF. The most recent works are learning-based approaches [25, 27, 13]. They are based on the use of a large database of images annotated with human-marked ground-truth to learn local figure/ground labels [25], or the set of parameters capturing the 3D location and orientation of small patches, or models of occlusion[13] based on both 2D and 3D depth cues. The inference is performed on a MRF [27] or on a Conditional MRF[25, 13] to enforce global consistency. Most of described approaches have been tested only on a limited set of synthetic images [33, 26, 9, 17, 16, 21, 30], or on images previously segmented by interactive methods [29, 8, 25]. Impressive results on real images have indeed been showed when using learning-based approaches [25, 13, 27]. However, they are obtained using a ratio between the number of test images and the number of training images, with manual assignement of the ground truth, almost equal to 1. In the next section, we propose a full automatic method completely based on gestalt phenomenology that can account for a variety of phenomena on real images. The first relevant works on monocular depth perception appeared at the beginning of the nineties and presented solutions based on two different perspectives: the contour processing and the region processing perspective. Due to the crucial role of depth perception in the interpretation of illusory contours, most of these seminal works were developed by psychologists and conceived as computational models of illusory contours. From the contour-processing perspective, the formation of a global percept from local cues has been modeled as an optimization process with a contour interpretation mechanism. Williams [33] described the occlusion mechanism by a set of integer linear constraints. These constraints insure the physical consistency of a contour grouping process with the image evidence. The main limitation of this work is that it foregoes purely local use of local evidence. Saund [26] proposed a solution to this problem based on the use of a token-based algorithmic framework allowing locally derived constraints to propagate globally around a junction graph. The junction label assignment is conducted through annealing-style optimization, which is well known to be susceptible of local optima. Taking a neurophysiological perspective, Heitger et al. [11] proposed a grouping method which consists in convolving a representation of occlusion cues with a set of orientation selective kernels and nonlinear paring operations. This method cannot resolve ambiguities and tends to complete also the background. From the region-processing perspective, the formation of a global percept from local cues has been modeled as an optimization process with a surface diffusion mechanism. Mumford and Nitzberg [22] proposed a variational formulation presented as a variant of the Mumford and Shahs segmentation model [5], allowing regions to overlap. They first compute edges and T-junctions and then minimize the functional combinatorially with respect to all possible ways of connecting the T-junctions by new edges that is consistent with a given ordering hypothesis. This work has inspired more recent theoretical investigation, addressing the main issues of the numerical minimization of the functional [6] and the computational complexity [30]. Maradarasmi et al.[17] proposed a Bayesian formulation: assuming that all surfaces in the scene are piece-wise constant or frontoparallel, the problem of finding a piece-wise smooth segmentation of the image into surfaces is equivalent to the problem of assigning a discrete depth value to each image pixel. Stella et al.[29] extended Maradarasmi’s work by embedding into a Hierarchical MRF explicit decision rules that asserts continuity of depth assignments values along contours and within surfaces, and discontinuity of depth assignment value across contours. A linear diffusion formulation has indeed been proposed by Geiger et al.[9]. First, a set 3. Proposed Approach The method proposed here involves three main steps. The first step detects a set of monocular depth cues arising from elementary grouping laws. The second step encodes all local and non-local depth relationships, by acting in an additive fashion under non-conflictive conditions (collaboration) and in a exclusive fashion under conflictive conditions (masking). The last step operates a synthesis of all available depth information to infer shape and spatial layout of depicted objects. 96 Figure 2. Possible configurations of segments conveying the perception of X-junctions ((a),(c),(e)) and T-junctions ((b),(d)) Figure 1. (a) Occlusion. (b) Transparency. 3.1. Computing Monocular Depth Cues of occlusion and transparency are respectively T-junctions and X-junctions. Our method for detecting T-junctions and X-junctions is built on the response of the LSD proposed by Grompone et al. [10]. The perception of segments is related to the grouping law of constancy of direction (alignment), which is a special case of continuity of direction. LSD puts together two well known state of the art algorithms for segment-detection: the Burn’s segment detector [3] and the meaningful segment detector developed by Desolneux et al. [4]. First the image is segmented into line-support regions using the Burn’s strategy and the medium orientation is accurately computed for each support region. Then, following the approach of Desolneux et al. [4], segments are computed as outliers of an unstructured background model. The main advantage of this strategy is that the thresholds of the detection algorithm can be defined in order to control its expected number of false detection under the background model. In addition, the use of a previous line support-region detection step speeds up the computation leading to a line segment detector able to process images in linear time relative to the number of pixels. Furthermore, LSD leads to an easy visualization of T- and X-junctions, even if the junction center is often missing for the detection. In these cases, the visualization of junctions is the result of an interpolation process driven by the good continuation principle. Straight lines are extended and junctions are detected as intersection of straight lines. According to the number and the orientation of intersecting segments, junction points are classified. T-junctions can be detected as intersection of two or three segments ( Fig.2(b) and (d)) whereas X-junctions can be detected as intersection of two, three, or even four segments (Fig.2(a),(c),(e)). The intersection of two segments may lead to a T-junction or an X-junction depending on the position of the intersection point P with respect to the tips Ei of the segments. When all tips have sufficient distance from the P , they convey the perception of an X-junction (Fig.2(a)), otherwise of a T-junction (Fig.2(b)). The intersection at point P of three segments s1 , s2 , and s3 such that two of them, say s1 and s2 , are aligned may lead to a Tjunction or an X-junction depending on the position of P with respect to the tips E5 and E6 of the third segment s3 . When both tips of s3 have sufficient distance from P , then they convey the perception of a X-junction (Fig.2(c)), otherwise of a T-junction (Fig.2(d)). The intersection of four segments at a point P leads to an X-junction when the seg- In this work we focus on a subset of monocular depth cues that do not require any a priori information about the scene and should be regarded as a direct, immediate response to retina stimulation. For each cue, we detail the psychophysical description as well as its mathematical and computational translation. Probably the most important monocular depth cue is occlusion. Occlusion occurs when an opaque object partly obscures the view of another object further away from the viewpoint (Fig.1(a)). In this case, the projection of the object contours partially hiding each other creates T-shaped junctions in the image plane. The geometrical configuration of T-junctions encodes relative depth information of the objects in partial occlusion: the stem of the T belongs to the partially occluded object and the roof to the occluding object. A particular case of occlusion is transparency, which occurs when the occluding object is transparent and therefore the more distant objects are visible through the less distant transparent one (Fig.1 (b)). In this case, the projection of object contours creates X-shaped junctions in the image plane. Whereas the geometric characterization of Tjunctions alone provides a local signature of occlusion, in the case of transparency a photometrical characterization is also needed. At points where transparency occurs two distinct depths lie in the same line of sight. The process of separating a single luminance value into two contributions is known as scission. Metelli [19] derived two constraints on the photometric conditions required for perceptual scission. The first constraint is known as magnitude constraint: a transparent medium cannot increase the contrast of the visible structures. As a consequence, a region can scissor only if its contrast is less than or equal to the contrast of its flanking regions. The second constraint is known as polarity constraint: a transparent medium cannot alter the contrast polarity of structures visible through it. Polarity constraints provide a photometrical signature of transparency. Once scission has been identified, the problem of assigning surface properties correctly to the two depths is solved by using the magnitude constraint: the contrast between the regions belonging to the transparent medium is always lower than the contrast between the regions of the underlying object. From the above description results that figural signatures 97 where curv(u)(x) is the curvature of u at point x. If x is a point of Γ, then the curvature vector κ(u)(x) is normal to Du and points towards the Γ at x as the gradient vector |Du| center of the osculating circle. Figure 3. The polarity constraint tells us that s is the contour of the transparent object, since the polarity of the contrast between pairs of adjacent regions delimited by r ((A, B) and (C, D)), does not change when s is crossed A first example of conflict between elementary grouping laws arises in correspondence of T- and X- junctions. In fact, at these points the local interpretation of relative depth conveyed by convexity is never in agreement with the interpretation conveyed by occlusion or by transparency. This situation is called conflict by gestaltists and resolved by the masking or, in more neurophysiological terms, inhibition. As for any other case of conflict, the grouping law that gives the better global explanation of the figure inhibits the competing one. At T- and X-junction points, the masking phenomenon implies the inhibition of convexity. Figure 4. Convexity: contrast polarity and texture property being equal in ((a) and (b)) and in ((c) and (d)), the region with convex contour tends to be perceived as foreground. Occlusion is one of two ways by which observation conditions lead to object obscuration. The second one is camouflage. In camouflage the occluding object is rendered invisible by matching the color or the texture of the background. In both cases the visual system interpolates missing data, a process known as visual completion. This process is important because it is one of the means by which the visual system organizes its depth measurements into meaningful bodies. In the case of occlusion, the perceptual completion of partially occluded objects is referred to as amodal completion. In the case of camouflage, the perceptual completion of occluding objects is referred to as modal completion (see Fig.7). In general, the regions of the image that are visible and lead to visual completion are referred to as ”inducers” [7]. Inducers of visual completion are pairs of T-junctions that, when connected by extrapolating one stem and connecting it with the stem of the other element of the pair, obey the ”good continuation” law. This means that the interpolated curve should be as similar as possible to the piece of curves it interpolates. According to the Kellman and Shipley’s theory of relatability [15] human vision does not always complete contours in presence of T-junctions but uses geometric relationships among them to reduce the number of interpretations that are consistent with a given image. These geometric relationships are synthesized under the concept of relatability. The definition of relatability is as follows (see Fig.5(a)). Two edges are said relatable if the process of interpolation begins and ends at the points of tangent discontinuity of the contour, called T-junctions, and, their linear extensions meet in their extended regions, forming an outer angle Φ less of π/2. Psychophysical data suggest that within the category of relatable edges, there are quantitative variations in strength. As can be observed in Fig.5 (b), the strength of the perceived connection decreases when the angle between two edges increases (1, 2, 3) and/or the offset between two parallel edges increases (4, 5, 6). We use these quantitative variations to choose the best relatable T-junction for a given T-junction, when multiple candidate ments are two by two aligned and all segment tips have sufficient distance from P (Fig.2(d)). While occlusion is simply detected by using figural characterization of T-junctions, the detection of transparency involves a photometrical characterization as well since it requires to check the polarity constraint. Let A, B, C, and D be the four regions delimited by the contours r and s forming the X-junction and a squared window of size w centered at the junction center (see Fig.3). The gray level representative of each region, a, b, c and d is obtained as a median value on each region. If the regions A and B are separated by r and A and C are separated by s, then the polarity constraint is satisfied if the difference a − c has the same sign as the difference b − d or if the difference a − b has the same sign as the difference c − d. In the latter case, s is the contour of the transparent object and r is the contour of the underlying object. In the former case the contrary is true. In the absence of occlusion and transparency, the factors that determine which regions are perceived as foreground and which as background, given the complete description of the boundary contours, must be related to the shape of the regions and not to their contrast polarity, or any other texture property. With respect to other global shape properties, convexity has proved to have a stronger influence on figural organization. Its role has been illustrated by Kanizsa: any convex curve (even if not closed) suggests itself as the boundary of a convex body on the foreground (Fig.4). From a mathematical point of view, the convexity of a curve is related to the sign of its curvature. Let u : R2 → R be an image, Du the gradient of u and x a point of u such that Du(x) = 0 and in a neighborhood of x the iso-level set of u through x is a C 2 Jordan arc Γ. Then the curvature vector κ(u) at x is defined by κ(u)(x) = −curv(u)(x) Du |Du| (1) 98 (a) (b) Figure 7. (a) Modal completion: modal contours through an homogeneous zone. (b) Boundaries objects visualized using LSD: T-junctions show up as corners Figure 5. (a) Relatability geometry. (b) Strength variations of relatability. Figure 6. Amodal completion: pairs of relatable regions (A1 ,A2 ),(B1 ,B2 ), and (C1 ,C2 ) have similar gray level. Figure 8. FSPs and BSPs arising from: (a) T-junctions, (b) transparency, (c) convexity pairs satisfying the relatability conditions are possible. The angle is used as first criterion and, in case of angle parity, the offset between the two candidate edges is considered. As additional constraint for relatability we also impose a photometric condition. Let ai bi and ci be respectively the medium gray level of regions Ai , Bi and Ci delimited by the contours forming the T-junction and a squared window of size w centered at the junction center (see Fig.6). The relatability condition is checked only at pairs of T-junctions such that the medium gray level of the region forming the top, say a1 , and the regions forming the stem b1 and c1 have a medium gray level similar respectively to a2 , b2 and c2 or a2 , c2 and b2 . In the case of camouflage, T-junctions show up as line ends that correspond to the stem of the T. When the occluding object matches the color of only one of the two background objects, pairs of T-junctions that lead to modal completion show up as pairs of corners (see Fig.7). We shall call angles that lead to modal completion degenerated T-junctions. Pairs of degenerated T-junctions are detected using the quantitative variations of relatability. This criterion allows also to take a decision of which of the two segments forming the corner is the stem of the T-junction. For instance in Fig.7 the application of this criterion lead to see the triangle behind the square. points marking the regions that are closer to the viewpoint and background source points (BSPs) the points more distant to the viewpoint. We assign a positive value to FSPs, and zero to BSPs. The rest of the image is initialized with value zero. The way source points are computed depends on the type of depth cue. Source points arising from a Tjunction at point P are computed as follows (see Fig.8(a)). Let s be the line containing the segment that forms the stem of the T-junction. Let m1 and m2 be the points belonging to s and having distance d from P . If m2 is the point lying on the stem and r1 the line perpendicular to s and passing through m2 , then m1 is the FSP and the points m3 and m4 belonging to r1 and having distance d from m2 are the BSP. Source points arising from convexity at point P of a curve are computed in the following way (see Fig.8(c)). Let r be the line passing through P and having the direction of the gradient at P . Let m1 and m2 be the points belonging to r and having distance d from P . If m1 is the point lying on the half-line having origin in P and oriented as the curvature vector at x, then m1 is the BSP and m1 the FSP. Source points arising from transparency at point P are computed as follows (see Fig.8(b)). Let s be the line containing the contour of the transparent object, and m1 and m2 be the points belonging to s and having distance d from P . Let r1 be the line perpendicular to s and passing through m1 , and r2 the line perpendicular to s and passing through m2 . Let m3 and m4 be the points belonging to r1 and having distance d from m1 , and m5 and m6 the points belonging to r2 and having distance d from m2 . If the gray level difference between m4 and m6 is larger than the gray level difference between m3 and m5 , then m3 and m5 are FSPs whereas 3.2. Computing Initial Depth Values Let z be the depth image. We call source points the points for which the initial depth gradient Dz0 is not zero and normal points the points for which Dz0 = 0. Source points arise in correspondence of depth cues. In the following we shall call foreground source points (FSPs) all source 99 m4 and m6 are the BSPs. The distance d is at least 4 pixels to take into account image blur. It allows one to jump over edges. information about this partial order is present, the depth gradient between one of the BSPs and the FSPs increases. This is the reason for which we force source points to hold ”at least” the initial depth gradient. To handle visual completion, after each iteration pairs of relatable regions (see Fig.6) are forced to maintain the same depth. In the case of modal completion, one of the two BSP has a gray level similar to the one of the FSP. For this reason we modify the way the neighborhood is defined. Let r and s be the lines the modal contours lie on. The neighborhood Nρ is defined as follows: Nρ = {y | y ∈ Bρ (x), y ∈ αr (x), y ∈ βs (x)}, where αr (x) is the half image plane including x with origin the line r and βs (x) is the half image plane including x with origin the line s. 3.3. Depth Diﬀusion Once source points have been computed, our goal is to extrapolate relative depth values to the entire image domain. To this goal we use a neighborhood filter. A neighborhood filter is any filter which performs an average of the values of pixels which are close in gray level value. The underlying assumption is that pixels belonging to the same object have a similar gray level. The average is commonly computed on pixels belonging to the neighborhood in spatial distance as in the Yaroslavsky neighborhood filter (YNF) [34], the SUSAN filter [28] and the bilateral filter [31], or in a fully non-local way as in the non-local means [2]. Let u be an image defined on a bounded domain Ω ∈ R2 . The YNF computes a weighted average that can be written in a continuous form as −|u(x)−u(y)|2 1 h2 Y N Fh,ρ u(x) = u(y)e dy (2) C(x) Bρ (x) 4. Experimental Results We tested our model on a set of real images (taken by a digital camera) involving occlusion, transparency, convexity, visual completion (both amodal and modal) and self-occlusion. For each experiment we show four images: the original image; the image showing the segments found by applying LSD on the original image; the image where the initial depth gradient at depth cue points is represented through vectors pointing to the region closer to the viewpoint (red vectors arise from T-junctions; green vectors arise from local convexity and each of them represents the point having the biggest curvature value of the connected components obtained by thresholding the curvature); the depth image obtained performing the proposed method. The depth map is rendered through gray level values (high values indicate regions that are close to the camera). In the example on the first row (Fig.9), local convexity induces to see the disk over the table. The second row is an example involving convexity and occlusion: it shows that the proposed method is able to handle simple sorting in presence of multiple depth layers. The third and the fourth rows are examples of amodal and modal completion respectively: in the former case, the detection of pairs of relatable T-junctions leads to see the green piece of paper partially occluded by the white strips as a meaningful unit; in the latter case, the detection of a pairs of degenerated T-junctions leads to see the rectangle in front of the square. In the example on the fifth row, the transparency phenomenon is correctly interpreted. In the example on the sixth row, occluding contours have different depth relationships at different points along its continuum. However, the proposed method performs well also in this ambiguous situation. The examples on the last two rows involve more realistic scenarios. While in the first example the solution is pretty contrived, in the second we show a case of failure. What has caused the break in this case is that a region with homogeneous texture (the mountain behind the rock) has been marked as FSP thanks to the T-junction where Bρ (x) is a ball with radius ρ and center x, x ∈ Ω and −|u(x)−u(y)|2 h2 C(x) = Bρ (x) e dy is the normalization factor. Neighborhood filters have been proved to be asymptotically equivalent to a Perona-Malik equation [1], one of the first nonlinear PDE used for image restoration. The diffusion process on the depth image z is performed using the gray level image u to define the neighborhood. In order to make the diffusion process faster, the sup of the neighborhood is taken instead of the average while the average is taken only in the last iterations. The depth diffusion filter (DDF) can be written in a continuous form as DDFh,ρ z(x) = sup z(y)e −|u(x)−u(y)|2 h2 (3) y∈Bρ (x) This filter is applied iteratively until the stability is attained. After each iteration, the values of FSPs and BSPs are modified in order to hold at least the initial depth gradient. This constraint corresponds to Neumann internal boundary conditions which are understood as a prespecified jump on the Dz c Dn as the boundary crossed, where c is a positive constant and n is the normal to the boundary. This allows one to handle simple sorting when objects are located in multiple layers. In the case of occlusion and transparency there is also a depth order between the two regions separated respectively by the stem of the T and by the contour of the underlying object. Occlusion and transparency do not carry any information about the partial order between the underlying object and the background. This depth order can be inferred by other cues, such as convexity or visual completion. When 100 (a) (b) (c) (d) Figure 9. (a) Original (b) Segments detected by LSD (c) Local Depth Information (d) Depth image 101 on the rock pick and as BSP thanks to the curvature vector. This example also demonstrates that the proposed DDF can handle homogeneous texture (see mountains and the biggest rock) but fails when shading conditions cause strong intensity oscillations (see the rock in the bottom-right corner). [9] D. Geiger and P. L. Visual organization for figure-ground separation. In CVPR, pages 155–160, 1996. [10] R. Grompone von Gioi et al. Lsd: A line segment detector. Submitted to IEEE Tr. on PAMI, 2007. [11] F. Heitger and R. von der Heydt. A computational model of neural contour processing: Figure-ground segregation and illusory contours. In ICCV, pages 32–40, 1993. [12] H. Helmholtz. Treatise on Physiological Optics. James P. C. Southall, 1925. [13] D. Hoiem et al. Recovering Occlusion Boundaries from a Single Image. In ICCV, pages 1–8, 2007. [14] G. Kanizsa. La Grammatica del Vedere. Diderot, 1996. [15] P. Kellman and T. Shipley. Visual interpolation in object perception. Current Directions in Psychological Science, 1(6):193–199, 1991. [16] N. Kogo et al. Reconstruction of subjective surfaces from occlusion cues. In Biologically Motivated Computer Vision: second workshop of BMVC, pages 311–312, 2002. [17] S. Madarasmi et al. Illusory contour detection using MRF models. In World Congress on Computational Intelligence, pages 4343 – 4348, 1994. [18] D. Marr. Vision. W.H.Freeman and Co., New York, 1982. [19] F. Metelli. The perception of transparency. Scientific American, 230:354–366, 1974. [20] W. Metzger. Gesetze des Sehens. Waldemar Kramer, 1975. [21] P. Mordohai and G. Medioni. Junction Inference and Classification for Figure Completion using Tensor Voting. In CVPRW, volume 4, pages 56–64, 2004. [22] M. Nitzberg and D. Mumford. The 2.1-D Sketch. In ICCV, pages 138–144, 1990. [23] A. Pentland. A new sense for depth of field. In ICCV, pages 839–846, 1985. [24] M. Proesmans and L. Gool. Grouping based on coupled diffusion maps. LNCS, pages 196–216, 1999. [25] X. Ren et al. Figure/ground assignment in natural images. In ECCV, pages 614–627, 2006. [26] E. Saund. Perceptual organization of occluding contours of opaque surfaces. Computer Vision and Image Understanding, 76(1):70–82, 1999. [27] A. Saxena et al. Learning 3-d scene structure from a single still image. In ICCV, pages 1–8, 2007. [28] S. Smith and M. Brady. Susan - a new approach to low level image processing. IJCV, 23(1):45–78, 1997. [29] X. Stella et al. A hierarchical markov random field model for figure-ground segregation. In CVPR, pages 110–133, 2001. [30] S. Thiruvenkadam et al. Segmentation under occlusion using selective shape prior. Scale Space and Variational Methods in Computer Vision, 4485:191–202, 2007. [31] C. Tomasi and R. Manduchi. Bilateral filter for gray and color images. In ICCV, pages 988–994, 1998. [32] M. Wertheimer. Untersuchungen zur Lehre der Gestalt, II. Psycologische Forschung, 4:301–350, 1923. [33] L. Williams. Perceptual organization of occluding contours. In ICCV, pages 133–137, 1990. [34] L. Yaroslavsky. Digital picture processing - an introduction, 1985. New York: Springer-Verlag. 5. Conclusions In this work we have proposed a mechanism for monocular depth perception completely based on phenomenology. Experimental results involving occlusion, transparency, convexity, visual completion (both amodal and modal) and self-occlusion have shown a correct interpretation of several real images. In contrast with anterior state of the art, the cue detection was automatic and the depth synthesis led by a very elementary mechanism, namely an iterated neighborhood filter. However, the experiments shown here on real images give a high confidence to the DDF as a way to diffuse depth information from local depth cues. In particular, contradictory information given by conflicting depth cues were dealt with correctly by the proposed mechanism which permits two regions to invert harmoniously their depths, in full agreement with phenomenology, and very diverse gestalt laws were fused harmoniously within this simple and plausible mechanism. Although the experiments have been performed on real images, a new generation of detectors will be needed to deal with real world images, where T-junctions, convexity, etc. cannot always be computed from local information. Further research must therefore focus on more and more global cue detectors. References [1] A. Buades et al. Neighborhood filters and pde’s. Numerische Mathematik, 105(1):1–34, 2006. [2] A. Buades et al. The staircasing effect in neighborhood filters and its solution. IEEE Tr. on IP, 15(6):1499–1505, 2006. [3] J. Burns et al. Extracting straight lines. IEEE Tr.on PAMI, 8(4):425 – 455, 1986. [4] A. Desolneux et al. Meaningful alignments. IJCV, 40(1):7– 23, 2000. [5] D.Mumford and J.Shah. Optimal Approximations of Piecewise Smooth Functions and Associated Variational Problems. J. on Communications in Pure and Applied Mathematics, 42:577–685, 1989. [6] S. Esedoglu and R. March. Segmentation with depth but without detecting junctions. J.Math.Imaging and Vision, 18:7–15, 2003. [7] R. Fleming and B. Anderson. The Visual Neurosciences. Cambridge,MA: MIT Press, pages 1284–1299. Chalupa, L. and Werner, J.S. Eds., 2004. [8] R.-X. Gao et al. Bayesian inference for layer representation with mixed markov random field. LNCS, 4679:213– 224, 2007. 102

RELATED PAPERS

RELATED TOPICS

Log In

Monocular Depth by Nonlinear Diffusion

Monocular Depth by Nonlinear Diffusion

Related Papers

RELATED PAPERS

RELATED TOPICS