Keywords

1 Introduction

In recent years, Minimally Invasive Surgery (MIS) has become a very important technology in medical operation. Compared to traditional surgical operations, MIS can reduce the wounds and shorten the patient’s recovery time. However, the physicians need to be well-trained so that the operations could be accurate and safe. Traditionally, the physician holds the instruments for treatment, while looking at the LCD display in front of him/her which shows the video captured by the camera. Since the video is two-dimensional without depth perception, the physicians need to be skilled in operation so that the treatment is accurate and safe, without spending much time in achieving the parts of an organ that needs to be surged. Experiments through dual-eye stereo endoscope and 3D display have shown that 3D perception is capable of helping the surgeons in reducing the operation time by up to 50% relative to the traditional 2D endoscope (e.g., the famous Davinci surgical system has been equipped with a stereo endoscopic camera). However, the surgeons need to wear an IR-synchronized glasses so as to perceive the 3D effect. An autostereoscopic display (i.e., naked-eye 3D display) could release the surgeons from heavy glasses and make them more comfortable during surgical operations. This requires to provide depth information, which when accompanying with the texture (i.e., color) information, can be used to synthesize multiple novel views based on the well-known DIBR (Depth-Image-Based-Rendering) technique for 3D display.

Stereo matching (or disparity estimation) has been an important step in estimating depths of 3D points from two views (left and right) for computer vision applications. It can be categorized into local/window-based and global-based methods. In general, the global-based algorithms [1,2,3,4,5,6, 12, 14] minimizes an energy function, including data term and smooth term, to obtain optimal results through an iterative procedure. Though good depth quality can be obtained, the disadvantages are high complexity, considerable computation cost, and difficulty of hardware implementation. On the other hand, the local/window-based algorithms [7,8,9, 15,16,17,18,19] are easier to achieve, but less accurate than the global kind.

The simplest local approach to disparity estimation is the block matching procedure similar to motion estimation in video coding standards. However, the estimation result is too noisy to be used for practical applications. During block matching, the emphases are on the matching cost computation and the cost aggregation steps. Final disparity value for each pixel is determined based on the one with the minimally aggregated cost value. In order to reduce ambiguity, the matching costs are aggregated over a support window (often, a rectangle). A larger window size (e.g., over 35 × 35) often has better performance, but spending much time for computation. A fixed window often does not deal explicitly with uniform areas and repetitive patterns and ignores depth discontinuities. Hence, multiple windows are adopted in [16] for generating several depth candidates for each image pixel and a robust strategy is used for extracting high quality estimates from depth candidates. In [15], a new window-based method using varying support-weights to get accurate results at depth discontinuities as well as in homogeneous regions is proposed. The support-weights of the pixels in a given support window are computed based on color similarity and geometric proximity. However, their pixel-wise adaptive support-weight makes the computation a little more expensive than other area-based local methods. In [17], color segmentation technique is used to assist in finding a better matching window. This is beneficial in disparity estimation at depth discontinuity regions, but causes more complex computation. In [18], a multi-directional varying-scale polygon is used to approximate the spatially varying image structure around each image pixel, called locally adaptive polygon approximation (LAPA). LAPA still cannot accurately approximate the detailed image structure. Hence, the same authors continue to propose a cross-based support window approach [19], which leads to a superior performance than LAPA in approximating uniform color areas (whose pixels are supposed to have similar disparities) and presents hardware efficiency based on constant-time computation via integral images.

Endoscopic images in MIS are often captured within a distance of about 20 cm based on a point light source (e.g., laparoscopy). Compared to natural indoor/outdoor images, they are featured of: (1) possibly uneven lighting, (2) less and unclear texture on organ surfaces, (3) mirror-like reflection of the metal instrument, and (4) possible blur due to high instant motion of the camera or instrument. Figure 1 shows two examples of endoscopic images captured for a phantom and a pig living body, respectively.

Fig. 1.
figure 1

Endoscopic images for (a) a phantom, (b) pig living body.

Due to the lack of significant textures for endoscopic images, stereo matching that estimates the disparity map faces a challenge, which is similar to the smoothness problem for traditional algorithms. That is, to estimate a dense disparity (depth) map, there would be a number of estimated disparities that have low confidence. In order to overcome this problem, we propose a strategy that traditional stereo matching is only applied to selected pixels that possesses high textures and might result in high confidence in disparity estimation. Disparities for pixels not selected are then obtained by spatially propagating the known disparities. Since the selected feature pixels are often sparse, the computation time for dense disparity map can be also reduced significantly.

The flowchart of our proposed disparity estimation algorithm based on sparse feature points is illustrated in Fig. 2. First of all, we define feature points that will be extracted for stereo matching. Different kinds of feature points are then estimated with disparities based on different algorithms. Finally, the estimated sparse disparities are spatially propagated to other non-estimated pixels to get a dense disparity (depth) map for 3D application (e.g., DIBR novel-view synthesis).

Fig. 2.
figure 2

Flowchart of proposed method.

2 Selection and Extraction of Feature Points for Disparity Estimation

The sparse feature points can be from four sources: (1) SIFT [10] feature points, (2) Canny edge points, (3) Canny-edge-dilated points, and (4) grid points, as shown in Table 1, where their methods for disparity estimation are also listed correspondingly.

Table 1. Feature points selected for disparity estimation in our system.

SIFT (Scale-Invariant Feature Transform) is well-known to act as good feature points in image registration applications. Points of high-strength in both the left and right views can be detected and their SIFT descriptors composed of 128-dimensional data are calculated for matching. Since the stereo views are often pre-calibrated to have horizontal epipolar lines, wrong matching can be easily removed. The disparity for paired SIFT feature points (one from left view and the other from right view) can be obtained from their displacement.

Since the number of SIFT feature points is often limited (only several hundreds), Canny edge points will constitute our main source of candidates for disparity estimation, as shown in Fig. 3.

Fig. 3.
figure 3

Result of Canny edge detection

In principle, Canny-edge points are located at boundaries between two depth planes. We need seeds at both planes (i.e., both sides of the boundaries) so that the respective depths (disparities) can be propagated throughout other portions. To overcome this problem, Canny-edge points are first expanded via two windows of different sizes. A differencing of these two expanded maps is conducted to obtain the Canny-edge-dilated points, as shown in Fig. 4.

Fig. 4.
figure 4

Canny-edge-dilated points, (a) Canny-edge expanded by windows of 21 × 21 pixels, (b) Canny-edge expanded by windows of 19 × 19 pixels, (c) dilated result by differencing between (a) and (b).

Since it is possible that there does not exist any Canny-edge points in smooth areas, we generate grid points (as shown in Fig. 5) so that feature points for disparity estimation can be distributed more uniformly in the image. To optimize the performance, a neighborhood of each grid point which has the largest gradient over the window (e.g., 3 × 3) is selected to replace the corresponding grid point.

Fig. 5.
figure 5

Grid points

From Table 1, it can be seen that except the SIFT feature points, disparity estimation of all the other three kinds will resort to block matching. The window size in block matching here depends on the type of the feature points (see experiments).

To enhance the accuracy of our sparse disparities, random sample consensus (RANSAC) algorithm is used in each kind to eliminate false matching pairs based on homographic transform error. That is, only the sifted feature points are remained as seeds for spatial propagation.

3 Spatial Propagation for Global Disparity Map

3.1 Iterative Interpolation by Bilateral Filtering

Since our disparity points are sparse, pixels of unknown disparity are first iteratively interpolated based on the known ones. They are then refined based on a criterion of global optimization. Bilateral filtering is used for interpolation in our system:

$$ D\left( {x,y} \right) = \frac{{\mathop \sum \nolimits_{{\left( {x_{i} ,y_{i} } \right) \in N\left( {x,y} \right)}} w\left( {x_{i} ,y_{i} } \right)D\left( {x_{i} ,y_{i} } \right)}}{{\mathop \sum \nolimits_{{\left( {x_{i} ,y_{i} } \right) \in N\left( {x,y} \right)}} w\left( {x_{i} ,y_{i} } \right)}} $$
(1)
$$ \varvec{w}\left( {\varvec{x}_{\varvec{i}} ,\varvec{y}_{\varvec{i}} } \right) = \varvec{w}_{\varvec{c}} \left( {\varvec{x}_{\varvec{i}} ,\varvec{y}_{\varvec{i}} } \right)\varvec{w}_{\varvec{d}} \left( {\varvec{x}_{\varvec{i}} ,\varvec{y}_{\varvec{i}} } \right) $$
(2)
$$ w_{c} \left( {x_{i} ,y_{i} } \right) = { \exp }\left( { - \left( {\frac{{\left( {I\left( {x,y} \right) - I\left( {x_{i} ,y_{i} } \right)} \right)^{2} }}{{2\sigma_{c}^{2} }}} \right)} \right) $$
(3)
$$ w_{d} \left( {x_{i} ,y_{i} } \right) = { \exp }\left( { - \left( {\frac{{\left( {\left( {x_{i} - x} \right)^{2} - \left( {y_{i} - y} \right)^{2} } \right)}}{{2\sigma_{d}^{2} }}} \right)} \right) $$
(4)

where \( \left( {x,y} \right) \) is the pixel of unknown disparity, \( D\left( {x,y} \right) \) is the estimated disparity, \( N\left( {x,y} \right) \) is the neighborhood of \( \left( {x,y} \right) \), \( w_{c} \left( {x_{i} ,y_{i} } \right) \) is for color weighting, \( w_{d} \left( {x_{i} ,y_{i} } \right) \) is for distance weighting, \( I\left( {x,y} \right) \) is the intensity at \( \left( {x,y} \right) \), \( \sigma_{c} \) and \( \sigma_{d} \) are parameters that determine the sharpness of the weighting function. Equations (1)–(4) are iteratively executed from the neighborhoods of the feature points until all pixels are interpolated. The result is as shown in Fig. 6.

Fig. 6.
figure 6

Disparity interpolation by bilateral filtering.

3.2 Global Refinement of Disparities

The interpolated disparities are adopted as the initial disparity map for refinement. By referring to [11], refinement is based on Eqs. (5) and (6):

$$ D^{t} \left( {x,y} \right) = \frac{{\mathop \sum \nolimits_{{\left( {x_{i} ,y_{i} } \right) \in N\left( {x,y} \right)}} w\left( {x_{i} ,y_{i} } \right)c\left( {x_{i} ,y_{i} } \right)D^{t - 1} \left( {x_{i} ,y_{i} } \right)}}{{\mathop \sum \nolimits_{{\left( {x_{i} ,y_{i} } \right) \in N\left( {x,y} \right)}} w\left( {x_{i} ,y_{i} } \right)c\left( {x_{i} ,y_{i} } \right)}} $$
(5)
$$ c^{t} \left( {x,y} \right) = \mathop {\hbox{max} }\nolimits_{{\left( {x_{i} ,y_{i} } \right) \in N\left( {x,y} \right)}} w\left( {x_{i} ,y_{i} } \right)\left( {c^{t - 1} \left( {x_{i} ,y_{i} } \right)} \right) $$
(6)

where t represents the iteration number, \( w\left( {x_{i} ,y_{i} } \right) \) is the same as Eq. (2), \( c\left( {x_{i} ,y_{i} } \right) \) represents the confidence value for the disparity propagated from the neighboring point \( \left( {x_{i} ,y_{i} } \right) \). Note that Eq. (5) is similar to Eq. (1), except the extra confidence term for each pixel, whose initial value is 1 for the feature points and 1/n for the interpolated points (n is the iteration number when the disparity is interpolated). The confidence terms will be updated according to the texture information. Normally, \( N\left( {x,y} \right) \) is chosen as a 3 × 3 window (Fig. 7).

Fig. 7.
figure 7

Disparity map after global refinement.

4 Experimental Results

The stereo endoscopic images are captured by using KARL STORZ 3D System, as shown in Fig. 8. Three sets of images are captured from a phantom to simulate MIS environment and the 4th one is captured from a pig living body. The experiments are based on an Intel(R) Core(TM) i7-4790 3.60 GHz CPU with 16 GB memory, and Visual Studio C/C++ 2010 development platform.

Fig. 8.
figure 8

The KARL STORZ 3D endoscope system.

The experiments are based on the following parameters: (1) minimum/maximum disparity = 0/300, (2) grid sampling = 8 × 8, (3) matching window size for Canny and Canny-edge-dilated points = 11 × 11, (4) matching window size for grid points = 31 × 31. The reasons that Canny-edge, Canny-edge-dilated, and grid points adopt different window sizes come from the following considerations: (1) Canny-edge and Canny-edge-dilated points might be located near depth boundaries, where the occlusion often influences the accuracy of block matching, so the window size is reduced accordingly, (2) grid points often possess less textures, so the window size is increased accordingly.

Figure 9 shows the matching results between the left and right views (for the first set of images) for four kinds of feature points. The feature points are colored in different color for easier discrimination. The number of matched pairs for them are 231, 18181, 11924, and 4996, respectively.

Fig. 9.
figure 9

Results of feature point matching between the left and right views.

We compare several methods in Fig. 10: (b) traditional block matching (BM), (c) cross-based method [19], (d) semi-global matching [13], and (e) proposed. In Fig. 10(b), (d), red pixels stand for disparity estimation that are considered unreliable. It is obvious that BM leads to noisy results, as expected. It is however surprised that the well-known r-sgm method [13] fails to estimate reliable disparities for endoscopic images. Over 80% of areas are considered unreliable and colored red! On the other hand, the popular cross-based matching method [19] also fails in estimating an accurate boundaries for the instrument. Our proposed algorithm, though still have space for improvement, is however capable of detecting the shape of the instrument.

Fig. 10.
figure 10

Experimental results for endoscopic images. (a) Original images, (b) results of block matching, (c) results of [19], (d) results of [13], (e) results of our proposed algorithm. (Color figure online)

Table 2 lists the numbers of the original and sifted feature points and their corresponding proportions for each test image set (the original image size is 1620 × 540 pixels). The average proportions before and after sifting are 7.065% and 3.15%, respectively, which are small enough to be helpful to reduce the overall computing time.

Table 2. Numbers and proportions of the feature points for 4 test image sets.

5 Conclusion

Due to the lack of distinct textures, it is difficult to estimate a correct and reliable disparity map for stereo endoscopic images. This paper proposes a strategy of estimating disparities based on a spare set of feature points, which is then spatially propagated to others to form a dense disparity map. We propose to define the sparse set of feature points to include: SIFT feature, Canny-edge, Canny-edge-dilated, and grid points. Their estimated disparities are then verified (sifted) to constitute the seeds for spatial propagation. Bilateral filtering is adopted for disparity interpolation first, whose results are then refined with information propagated from spatial neighbors.

Experiments show that many well-known algorithms fail in our endoscopic image sets, while our algorithm is successful in detecting the shape of the instruments. Disparity map estimation for endoscopic images is still an open issue for research. Our preliminary result still presents some space for improvement.

The future works will be on the selection of prominent feature points which are more suitable for reliable disparity estimation and propagation.