Abstract
Disparity estimation for stereo endoscopic images is difficult due to their lack of distinct textures for matching. Many well-known algorithms fail in this kind of application to estimate a reliable dense disparity map. In this paper, we propose a strategy of using a sparse feature point set to estimate reliable disparity values, which are then propagated to other non-feature points to form the final dense disparity map. Our selected feature points include: SIFT, Canny-edge, Canny-edge-dilated, and grid points. The algorithms for disparity propagation are based on bilateral interpolation and refinement. Experiments show that our algorithm is successful in shaping the instruments in disparity map. This is helpful in 3D display for human perception. Disparity estimation for endoscopic images is still an open issue. Our preliminary result still presents some space for improvement.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In recent years, Minimally Invasive Surgery (MIS) has become a very important technology in medical operation. Compared to traditional surgical operations, MIS can reduce the wounds and shorten the patient’s recovery time. However, the physicians need to be well-trained so that the operations could be accurate and safe. Traditionally, the physician holds the instruments for treatment, while looking at the LCD display in front of him/her which shows the video captured by the camera. Since the video is two-dimensional without depth perception, the physicians need to be skilled in operation so that the treatment is accurate and safe, without spending much time in achieving the parts of an organ that needs to be surged. Experiments through dual-eye stereo endoscope and 3D display have shown that 3D perception is capable of helping the surgeons in reducing the operation time by up to 50% relative to the traditional 2D endoscope (e.g., the famous Davinci surgical system has been equipped with a stereo endoscopic camera). However, the surgeons need to wear an IR-synchronized glasses so as to perceive the 3D effect. An autostereoscopic display (i.e., naked-eye 3D display) could release the surgeons from heavy glasses and make them more comfortable during surgical operations. This requires to provide depth information, which when accompanying with the texture (i.e., color) information, can be used to synthesize multiple novel views based on the well-known DIBR (Depth-Image-Based-Rendering) technique for 3D display.
Stereo matching (or disparity estimation) has been an important step in estimating depths of 3D points from two views (left and right) for computer vision applications. It can be categorized into local/window-based and global-based methods. In general, the global-based algorithms [1,2,3,4,5,6, 12, 14] minimizes an energy function, including data term and smooth term, to obtain optimal results through an iterative procedure. Though good depth quality can be obtained, the disadvantages are high complexity, considerable computation cost, and difficulty of hardware implementation. On the other hand, the local/window-based algorithms [7,8,9, 15,16,17,18,19] are easier to achieve, but less accurate than the global kind.
The simplest local approach to disparity estimation is the block matching procedure similar to motion estimation in video coding standards. However, the estimation result is too noisy to be used for practical applications. During block matching, the emphases are on the matching cost computation and the cost aggregation steps. Final disparity value for each pixel is determined based on the one with the minimally aggregated cost value. In order to reduce ambiguity, the matching costs are aggregated over a support window (often, a rectangle). A larger window size (e.g., over 35 × 35) often has better performance, but spending much time for computation. A fixed window often does not deal explicitly with uniform areas and repetitive patterns and ignores depth discontinuities. Hence, multiple windows are adopted in [16] for generating several depth candidates for each image pixel and a robust strategy is used for extracting high quality estimates from depth candidates. In [15], a new window-based method using varying support-weights to get accurate results at depth discontinuities as well as in homogeneous regions is proposed. The support-weights of the pixels in a given support window are computed based on color similarity and geometric proximity. However, their pixel-wise adaptive support-weight makes the computation a little more expensive than other area-based local methods. In [17], color segmentation technique is used to assist in finding a better matching window. This is beneficial in disparity estimation at depth discontinuity regions, but causes more complex computation. In [18], a multi-directional varying-scale polygon is used to approximate the spatially varying image structure around each image pixel, called locally adaptive polygon approximation (LAPA). LAPA still cannot accurately approximate the detailed image structure. Hence, the same authors continue to propose a cross-based support window approach [19], which leads to a superior performance than LAPA in approximating uniform color areas (whose pixels are supposed to have similar disparities) and presents hardware efficiency based on constant-time computation via integral images.
Endoscopic images in MIS are often captured within a distance of about 20 cm based on a point light source (e.g., laparoscopy). Compared to natural indoor/outdoor images, they are featured of: (1) possibly uneven lighting, (2) less and unclear texture on organ surfaces, (3) mirror-like reflection of the metal instrument, and (4) possible blur due to high instant motion of the camera or instrument. Figure 1 shows two examples of endoscopic images captured for a phantom and a pig living body, respectively.
Due to the lack of significant textures for endoscopic images, stereo matching that estimates the disparity map faces a challenge, which is similar to the smoothness problem for traditional algorithms. That is, to estimate a dense disparity (depth) map, there would be a number of estimated disparities that have low confidence. In order to overcome this problem, we propose a strategy that traditional stereo matching is only applied to selected pixels that possesses high textures and might result in high confidence in disparity estimation. Disparities for pixels not selected are then obtained by spatially propagating the known disparities. Since the selected feature pixels are often sparse, the computation time for dense disparity map can be also reduced significantly.
The flowchart of our proposed disparity estimation algorithm based on sparse feature points is illustrated in Fig. 2. First of all, we define feature points that will be extracted for stereo matching. Different kinds of feature points are then estimated with disparities based on different algorithms. Finally, the estimated sparse disparities are spatially propagated to other non-estimated pixels to get a dense disparity (depth) map for 3D application (e.g., DIBR novel-view synthesis).
2 Selection and Extraction of Feature Points for Disparity Estimation
The sparse feature points can be from four sources: (1) SIFT [10] feature points, (2) Canny edge points, (3) Canny-edge-dilated points, and (4) grid points, as shown in Table 1, where their methods for disparity estimation are also listed correspondingly.
SIFT (Scale-Invariant Feature Transform) is well-known to act as good feature points in image registration applications. Points of high-strength in both the left and right views can be detected and their SIFT descriptors composed of 128-dimensional data are calculated for matching. Since the stereo views are often pre-calibrated to have horizontal epipolar lines, wrong matching can be easily removed. The disparity for paired SIFT feature points (one from left view and the other from right view) can be obtained from their displacement.
Since the number of SIFT feature points is often limited (only several hundreds), Canny edge points will constitute our main source of candidates for disparity estimation, as shown in Fig. 3.
In principle, Canny-edge points are located at boundaries between two depth planes. We need seeds at both planes (i.e., both sides of the boundaries) so that the respective depths (disparities) can be propagated throughout other portions. To overcome this problem, Canny-edge points are first expanded via two windows of different sizes. A differencing of these two expanded maps is conducted to obtain the Canny-edge-dilated points, as shown in Fig. 4.
Since it is possible that there does not exist any Canny-edge points in smooth areas, we generate grid points (as shown in Fig. 5) so that feature points for disparity estimation can be distributed more uniformly in the image. To optimize the performance, a neighborhood of each grid point which has the largest gradient over the window (e.g., 3 × 3) is selected to replace the corresponding grid point.
From Table 1, it can be seen that except the SIFT feature points, disparity estimation of all the other three kinds will resort to block matching. The window size in block matching here depends on the type of the feature points (see experiments).
To enhance the accuracy of our sparse disparities, random sample consensus (RANSAC) algorithm is used in each kind to eliminate false matching pairs based on homographic transform error. That is, only the sifted feature points are remained as seeds for spatial propagation.
3 Spatial Propagation for Global Disparity Map
3.1 Iterative Interpolation by Bilateral Filtering
Since our disparity points are sparse, pixels of unknown disparity are first iteratively interpolated based on the known ones. They are then refined based on a criterion of global optimization. Bilateral filtering is used for interpolation in our system:
where \( \left( {x,y} \right) \) is the pixel of unknown disparity, \( D\left( {x,y} \right) \) is the estimated disparity, \( N\left( {x,y} \right) \) is the neighborhood of \( \left( {x,y} \right) \), \( w_{c} \left( {x_{i} ,y_{i} } \right) \) is for color weighting, \( w_{d} \left( {x_{i} ,y_{i} } \right) \) is for distance weighting, \( I\left( {x,y} \right) \) is the intensity at \( \left( {x,y} \right) \), \( \sigma_{c} \) and \( \sigma_{d} \) are parameters that determine the sharpness of the weighting function. Equations (1)–(4) are iteratively executed from the neighborhoods of the feature points until all pixels are interpolated. The result is as shown in Fig. 6.
3.2 Global Refinement of Disparities
The interpolated disparities are adopted as the initial disparity map for refinement. By referring to [11], refinement is based on Eqs. (5) and (6):
where t represents the iteration number, \( w\left( {x_{i} ,y_{i} } \right) \) is the same as Eq. (2), \( c\left( {x_{i} ,y_{i} } \right) \) represents the confidence value for the disparity propagated from the neighboring point \( \left( {x_{i} ,y_{i} } \right) \). Note that Eq. (5) is similar to Eq. (1), except the extra confidence term for each pixel, whose initial value is 1 for the feature points and 1/n for the interpolated points (n is the iteration number when the disparity is interpolated). The confidence terms will be updated according to the texture information. Normally, \( N\left( {x,y} \right) \) is chosen as a 3 × 3 window (Fig. 7).
4 Experimental Results
The stereo endoscopic images are captured by using KARL STORZ 3D System, as shown in Fig. 8. Three sets of images are captured from a phantom to simulate MIS environment and the 4th one is captured from a pig living body. The experiments are based on an Intel(R) Core(TM) i7-4790 3.60 GHz CPU with 16 GB memory, and Visual Studio C/C++ 2010 development platform.
The experiments are based on the following parameters: (1) minimum/maximum disparity = 0/300, (2) grid sampling = 8 × 8, (3) matching window size for Canny and Canny-edge-dilated points = 11 × 11, (4) matching window size for grid points = 31 × 31. The reasons that Canny-edge, Canny-edge-dilated, and grid points adopt different window sizes come from the following considerations: (1) Canny-edge and Canny-edge-dilated points might be located near depth boundaries, where the occlusion often influences the accuracy of block matching, so the window size is reduced accordingly, (2) grid points often possess less textures, so the window size is increased accordingly.
Figure 9 shows the matching results between the left and right views (for the first set of images) for four kinds of feature points. The feature points are colored in different color for easier discrimination. The number of matched pairs for them are 231, 18181, 11924, and 4996, respectively.
We compare several methods in Fig. 10: (b) traditional block matching (BM), (c) cross-based method [19], (d) semi-global matching [13], and (e) proposed. In Fig. 10(b), (d), red pixels stand for disparity estimation that are considered unreliable. It is obvious that BM leads to noisy results, as expected. It is however surprised that the well-known r-sgm method [13] fails to estimate reliable disparities for endoscopic images. Over 80% of areas are considered unreliable and colored red! On the other hand, the popular cross-based matching method [19] also fails in estimating an accurate boundaries for the instrument. Our proposed algorithm, though still have space for improvement, is however capable of detecting the shape of the instrument.
Table 2 lists the numbers of the original and sifted feature points and their corresponding proportions for each test image set (the original image size is 1620 × 540 pixels). The average proportions before and after sifting are 7.065% and 3.15%, respectively, which are small enough to be helpful to reduce the overall computing time.
5 Conclusion
Due to the lack of distinct textures, it is difficult to estimate a correct and reliable disparity map for stereo endoscopic images. This paper proposes a strategy of estimating disparities based on a spare set of feature points, which is then spatially propagated to others to form a dense disparity map. We propose to define the sparse set of feature points to include: SIFT feature, Canny-edge, Canny-edge-dilated, and grid points. Their estimated disparities are then verified (sifted) to constitute the seeds for spatial propagation. Bilateral filtering is adopted for disparity interpolation first, whose results are then refined with information propagated from spatial neighbors.
Experiments show that many well-known algorithms fail in our endoscopic image sets, while our algorithm is successful in detecting the shape of the instruments. Disparity map estimation for endoscopic images is still an open issue for research. Our preliminary result still presents some space for improvement.
The future works will be on the selection of prominent feature points which are more suitable for reliable disparity estimation and propagation.
References
Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23(11), 1222–1239, November 2001
Kolmogorov, V., Zabih, R.: Computing visual correspondence with occlusions via graph cuts. In: Proceedings of International Conference on Computer Vision, vol. II, pp. 508–515 (2001)
Hong, L., Chen, G.: Segment-based stereo matching using graph cuts. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. I, pp. 74–81 (2004)
Sun, J., Zheng, N.-N., Shum, H.-Y.: Stereo matching using belief propagation. IEEE Trans. Pattern Anal. Mach. Intell. 25(7), 787–800, July 2003
Yang, Q., Wang, L., Yang, R., Stewenius, H., Nister, D.: Stereo matching with color-weighted correlation, hierarchical belief propagation and occlusion handling. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June 2006
Klaus, A., Sormann, M., Karner, K.: Segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure. In: Proceedings of International Conference on Pattern Recognition (2006)
Veksler, O.: Fast variable window for stereo correspondence using integral images. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 556–561 (2003)
Yuan, D., Cheng, F., Zhang, H.: Dense stereo matching based on edge constraint and variable windows. In: Proceedings of IEEE International Conference on Robotics and Biomimetics, pp. 1912–1917 (2011)
Qiu, M., Zhang, Y.: Feature guided multi-window area-based matching method for urban remote sensing stereo pairs. In: Proceedings of IEEE International Conference on Geoscience and Remote Sensing Symposium (IGARSS), pp. 1912–1917 (2011)
Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2(60), 91–110 (2004)
Yan, X., Yang, Y., Er, G., Dai, Q.: Depth map generation for 2D-to-3D conversion by limited user inputs and depth propagation. In: Proceedings of IEEE International Conference the True Vision-Capture, Transmission and Display of 3D Video (3DTV-CON), pp. 1–4, May 2011
Rother, C., Blake, A., Kolmogorov, V.: Grabcut – interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. 23(3), 309–314 (2004)
Spangenberg, R., Langner, T., Adfeldt, S., Rojas, R.: Large scale semi-global matching on the CPU. In: Proceedings of the IEEE Intelligent Vehicles Symposium Proceedings, pp. 195–201, June 2014
Yang, Q., Wang, L., Yang, R., Stewenius, H., Nister, D.: Stereo matching with color-weighted correlation, hierarchical belief propagation, and occlusion handling. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 31(3), 492–504 (2009)
Yoon, K., Kweon, I.: Adaptive support-weight approach for correspondence search. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 28, 650–656 (2006)
Boughorbel, F.: A new multiple-windows depth from stereo algorithm for 3D displays. In: Proceedings of IEEE International Conference on Capture, Transmission and Display of 3D Video (3DTV-CON), Kos Island, pp. 1–4, May 2007
Gerrits, M., Bekaert, P.: Local stereo matching with segmentation-based outlier rejection. In: Proceedings of IEEE International Conference on Computer and Robot Vision (CRV), New York, USA, p. 66, June 2006
Zhang, K., Lu, J., Lafruit, G.: Scalable stereo matching with locally adaptive polygon approximation. In: Procedings of IEEE International Conference on Image Processing (ICIP), pp. 313–316, October 2008
Zhang, K., Lu, J., Lafruit, G.: Cross-based local stereo matching using orthogonal integral images. IEEE Trans. Circuits Syst. Video Tech. 19(7), 1073–1079 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Lie, WN., Huang, HH., Huang, SW., Liu, KC. (2018). Using Sparse-Point Disparity Estimation and Spatial Propagation to Construct Dense Disparity Map for Stereo Endoscopic Images. In: Paul, M., Hitoshi, C., Huang, Q. (eds) Image and Video Technology. PSIVT 2017. Lecture Notes in Computer Science(), vol 10749. Springer, Cham. https://doi.org/10.1007/978-3-319-75786-5_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-75786-5_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75785-8
Online ISBN: 978-3-319-75786-5
eBook Packages: Computer ScienceComputer Science (R0)